Simulating stretched volume remote instance using a shadow volume on a local system

ABSTRACT

A simulated stretched volume may be configured from multiple volumes of a single data storage system. The volumes may be assigned unique identifiers. The volumes may be exposed to a host over paths from the single data storage system as the same volume having the same unique identifier. The single data storage system may include sets of target ports with each set simulating paths to a different data storage system. A management command may be received that is directed to the simulated stretched volume having the unique identifier. The management command may be received on a path from the host to a target port of the single data storage system. Servicing the management command may include the single data storage system simulating either the local or remote system depending on the set of target ports including the target port.

BACKGROUND Technical Field

This application generally relates to data storage.

Description of Related Art

Systems may include different resources used by one or more hostprocessors. The resources and the host processors in the system may beinterconnected by one or more communication connections, such as networkconnections. These resources may include data storage devices such asthose included in data storage systems. The data storage systems may becoupled to one or more host processors and provide storage services toeach host processor. Multiple data storage systems from one or moredifferent vendors may be connected and may provide common data storagefor the one or more host processors.

A host may perform a variety of data processing tasks and operationsusing the data storage system. For example, a host may issue I/Ooperations, such as data read and write operations, received at a datastorage system. The host systems may store and retrieve data by issuingthe I/O operations to the data storage system containing a plurality ofhost interface units, disk drives (or more generally storage devices),and disk interface units. The host systems access the storage devicesthrough a plurality of channels provided therewith. The host systemsprovide data and access control information through the channels to astorage device of the data storage system. Data stored on the storagedevice may also be provided from the data storage system to the hostsystems also through the channels. The host systems do not address thestorage devices of the data storage system directly, but rather, accesswhat appears to the host systems as a plurality of files, objects,logical units, logical devices or logical volumes. Thus, the I/Ooperations issued by the host may be directed to a particular storageentity, such as a file or logical device. The logical devices may or maynot correspond to the actual physical drives. Allowing multiple hostsystems to access the single data storage system allows the host systemsto share data stored therein.

SUMMARY OF THE INVENTION

Various embodiments of the techniques herein may include a method, asystem and a computer readable medium for processing management commandscomprising: creating a simulated stretched volume in a single datastorage system, wherein the simulated stretched volume simulates astretched volume configured from two or more volumes in two or more datastorage systems with the two or more volumes exposed to a host as a samevolume having a same first unique identifier over two or more paths fromthe two or more data storage systems, wherein the simulated stretchedvolume is configured from a plurality of volumes of the single datastorage system and the plurality of volumes are assigned a plurality ofunique identifiers associated with the simulated stretched volume, andwherein the plurality of volumes configured as the simulated stretchedvolume are exposed to the host over a plurality of paths from the singledata storage system as the same volume having the same first uniqueidentifier, wherein the single data storage system includes sets oftarget ports, wherein each of the sets of target ports simulates pathsto a different one of the two or more data storage systems; receiving,on a first path of the plurality of paths, a first management commanddirected to the simulated stretched volume configured as the same volumehaving the first unique identifier, wherein the first path is from aninitiator of the host to a first target port of the single data storagesystem, wherein the first target port is included in a first set of theplurality of sets of target ports and the first set of target portssimulates paths to a first data storage system of the two or more datastorage systems; and performing first processing to service the firstmanagement command, wherein the first processing includes the singledata storage system simulating the first data storage system servicingthe first management command.

In at least one embodiment, the single data storage system may include amanagement database with a plurality of metadata records for theplurality of volumes, wherein each of the plurality of volumes may bedescribed by metadata of a different one of the plurality of metadatarecords, and wherein each of the plurality of metadata recordsassociated with a particular one of the plurality of volumes may includea same set of metadata describing the simulated stretched volume and mayinclude one of the plurality of unique identifiers associated with theparticular one of the plurality of volumes. A first volume of theplurality of volumes in the single data storage system may represent andsimulate a particular one of the two or more volumes, wherein theparticular one volume is included in the first data storage system, andwherein the first processing includes servicing the first managementcommand using one of the plurality of metadata records associated withthe first volume.

In at least one embodiment, a second set of target ports may be includedin the plurality of sets of target ports of the single data storagesystem, wherein the second set of target ports may simulate paths to asecond data storage system of the two or more data storage systems. Asecond volume of the plurality of volumes in the single data storagesystem may represent and simulate another one of the two or morevolumes, wherein the another one volume may be included in the seconddata storage system.

In at least one embodiment, the first volume may be a regular volumeconfigured in the single data storage system with the first uniqueidentifier. The second volume may be a shadow volume of the regularvolume, and wherein the shadow volume may be configured with a secondunique identifier of the plurality of unique identifiers.

In at least one embodiment, the first processing may include using thefirst unique identifier to access a first set of metadata of a first ofthe plurality of metadata records associated with the regular volume.Servicing the first management command may include reading the first setof metadata associated with the first unique identifier; and returning aportion of the first set of metadata in accordance with the firstmanagement command. Servicing the first management command may include:updating, in accordance with the first management command, the first setof metadata of the first metadata record associated with the firstunique identifier and the regular volume; and simulating replicating thefirst management command over a connection to the second data storagesystem. The connection may be configured for a simulation mode thatsimulates the stretched volume and wherein the connection may beconfigured from the single data system to the single data storagesystem. Simulating replicating the first management command over theconnection to second data storage system may include: transmitting thefirst management command over the connection configured for thesimulation mode; mapping the first unique identifier to the secondunique identifier; and updating, in accordance with the first managementcommand, a second set of metadata of a second of the plurality ofmetadata records associated with the second unique identifier and theshadow volume.

In at least one embodiment, processing may include: receiving, over asecond path of the plurality of paths, a second management commanddirected to the simulated stretched volume configured as the same volumehaving the first unique identifier, wherein the second path is from aninitiator of the host to a second target port of the single data storagesystem, wherein the second target port is included in the second set ofthe plurality of sets of target ports that simulates paths to the seconddata storage system; and performing second processing to service thesecond management command, wherein the second processing includes thesingle data storage system simulating the second data storage systemservicing the second management command. The second processing mayinclude: mapping the first unique identifier associated with thesimulated stretched volume to the second unique identifier associatedwith the simulated stretched volume; and using the second uniqueidentifier to access the second set of metadata of the second metadatarecord associated with the shadow volume. Servicing the secondmanagement command may include: reading the second set of metadata ofthe second metadata record associated with the second identifier and theshadow volume; and returning a portion of the second set of metadata inaccordance with the second management command. Servicing the secondmanagement command may include: updating, in accordance with the secondmanagement command, the second set of metadata of the second metadatarecord associated with the second identifier and the shadow volume; andsimulating replicating the second management command over the connectionto the first data storage system.

In at least one embodiment, simulating replicating the second managementcommand over the connection to the first data storage system mayinclude: mapping the second unique identifier to the first uniqueidentifier; transmitting the second management command over theconnection configured for the simulation mode, wherein the secondmanagement command is directed to the regular volume having the firstunique identifier; and updating, in accordance with the secondmanagement command, the first set of metadata of the first metadatarecord associated with the first unique identifier and the regularvolume. Processing may include: receiving a first I/O command on thefirst path from the host to the single data storage system, wherein thefirst I/O command is directed to the simulated stretched volumeconfigured as the same volume having the first unique identifier; andservicing the first I/O command using the regular volume. Processing mayinclude receiving a second I/O command on the second path from the hostto the single data storage system, wherein the I/O command is directedto the simulated stretched volume configured as the same volume havingthe first unique identifier; and servicing the second I/O command usingthe regular volume.

BRIEF DESCRIPTION OF THE DRAWINGS

Features and advantages of the present invention will become moreapparent from the following detailed description of exemplaryembodiments thereof taken in conjunction with the accompanying drawingsin which:

FIG. 1 is an example of components that may be included in a system inaccordance with the techniques described herein.

FIG. 2 is an example illustrating the I/O path or data path inconnection with processing data in an embodiment in accordance with thetechniques herein.

FIG. 3 is an example of systems that may be used in performing datareplication.

FIG. 4 is an example illustrating an active-passive replicationarrangement.

FIG. 5 is an example illustrating an active-active arrangement with astretched volume.

FIG. 6 is an example illustrating path states for paths between a hostand a data storage system.

FIGS. 7A and 7B are examples illustrating path states for paths betweenmultiple data storage systems and multiple hosts in a metro clusterconfiguration with a stretched volume.

FIG. 7C is an example of a metro cluster configuration include threedata storage systems.

FIG. 8 is an example of illustrating components and data flow inconnection with processing commands with a simulated stretched volume inat least one embodiment in accordance with the techniques herein.

FIGS. 9A and 9B are examples of volume metadata that may be used inconnection with a simulated stretched volume in at least one embodimentin accordance with the techniques herein.

FIGS. 10A, 10B and 10C are flowcharts of processing steps that may beperformed in at least one embodiment in accordance with the techniquesherein.

DETAILED DESCRIPTION OF EMBODIMENT(S)

Referring to the FIG. 1 , shown is an example of an embodiment of asystem 10 that may be used in connection with performing the techniquesdescribed herein. The system 10 includes a data storage system 12connected to the host systems (also sometimes referred to as hosts) 14a-14 n through the communication medium 18. In this embodiment of thesystem 10, the n hosts 14 a-14 n may access the data storage system 12,for example, in performing input/output (I/O) operations or datarequests. The communication medium 18 may be any one or more of avariety of networks or other type of communication connections as knownto those skilled in the art. The communication medium 18 may be anetwork connection, bus, and/or other type of data link, such as ahardwire or other connections known in the art. For example, thecommunication medium 18 may be the Internet, an intranet, network(including a Storage Area Network (SAN)) or other wireless or otherhardwired connection(s) by which the host systems 14 a-14 n may accessand communicate with the data storage system 12, and may alsocommunicate with other components included in the system 10.

Each of the host systems 14 a-14 n and the data storage system 12included in the system 10 may be connected to the communication medium18 by any one of a variety of connections as may be provided andsupported in accordance with the type of communication medium 18. Theprocessors included in the host systems 14 a-14 n and data storagesystem 12 may be any one of a variety of proprietary or commerciallyavailable single or multi-processor system, such as an Intel-basedprocessor, or other type of commercially available processor able tosupport traffic in accordance with each particular embodiment andapplication.

It should be noted that the particular examples of the hardware andsoftware that may be included in the data storage system 12 aredescribed herein in more detail, and may vary with each particularembodiment. Each of the hosts 14 a-14 n and the data storage system 12may all be located at the same physical site, or, alternatively, mayalso be located in different physical locations. The communicationmedium 18 used for communication between the host systems 14 a-14 n andthe data storage system 12 of the system 10 may use a variety ofdifferent communication protocols such as block-based protocols (e.g.,SCSI (Small Computer System Interface), Fibre Channel (FC), iSCSI), filesystem-based protocols (e.g., NFS or network file server), and the like.Some or all of the connections by which the hosts 14 a-14 n and the datastorage system 12 may be connected to the communication medium 18 maypass through other communication devices, such as switching equipment, aphone line, a repeater, a multiplexer or even a satellite.

Each of the host systems 14 a-14 n may perform data operations. In theembodiment of the FIG. 1 , any one of the host computers 14 a-14 n mayissue a data request to the data storage system 12 to perform a dataoperation. For example, an application executing on one of the hostcomputers 14 a-14 n may perform a read or write operation resulting inone or more data requests to the data storage system 12.

It should be noted that although the element 12 is illustrated as asingle data storage system, such as a single data storage array, theelement 12 may also represent, for example, multiple data storage arraysalone, or in combination with, other data storage devices, systems,appliances, and/or components having suitable connectivity, such as in aSAN (storage area network) or LAN (local area network), in an embodimentusing the techniques herein. It should also be noted that an embodimentmay include data storage arrays or other components from one or morevendors. In subsequent examples illustrating the techniques herein,reference may be made to a single data storage array by a vendor.However, as will be appreciated by those skilled in the art, thetechniques herein are applicable for use with other data storage arraysby other vendors and with other components than as described herein forpurposes of example.

The data storage system 12 may be a data storage appliance or a datastorage array including a plurality of data storage devices (PDs) 16a-16 n. The data storage devices 16 a-16 n may include one or more typesof data storage devices such as, for example, one or more rotating diskdrives and/or one or more solid state drives (SSDs). An SSD is a datastorage device that uses solid-state memory to store persistent data.SSDs may refer to solid state electronics devices as distinguished fromelectromechanical devices, such as hard drives, having moving parts.Flash devices or flash memory-based SSDs are one type of SSD thatcontains no moving mechanical parts. The flash devices may beconstructed using nonvolatile semiconductor NAND flash memory. The flashdevices may include, for example, one or more SLC (single level cell)devices and/or MLC (multi level cell) devices.

The data storage array may also include different types of controllers,adapters or directors, such as an HA 21 (host adapter), RA 40 (remoteadapter), and/or device interface(s) 23. Each of the adapters (sometimesalso known as controllers, directors or interface components) may beimplemented using hardware including a processor with a local memorywith code stored thereon for execution in connection with performingdifferent operations. The HAs may be used to manage communications anddata operations between one or more host systems and the global memory(GM). In an embodiment, the HA may be a Fibre Channel Adapter (FA) orother adapter which facilitates host communication. The HA 21 may becharacterized as a front end component of the data storage system whichreceives a request from one of the hosts 14 a-n. The data storage arraymay include one or more RAs that may be used, for example, to facilitatecommunications between data storage arrays. The data storage array mayalso include one or more device interfaces 23 for facilitating datatransfers to/from the data storage devices 16 a-16 n. The data storagedevice interfaces 23 may include device interface modules, for example,one or more disk adapters (DAs) (e.g., disk controllers) for interfacingwith the flash drives or other physical storage devices (e.g., PDS 16a-n). The DAs may also be characterized as back end components of thedata storage system which interface with the physical data storagedevices.

One or more internal logical communication paths may exist between thedevice interfaces 23, the RAs 40, the HAs 21, and the memory 26. Anembodiment, for example, may use one or more internal busses and/orcommunication modules. For example, the global memory portion 25 b maybe used to facilitate data transfers and other communications betweenthe device interfaces, the HAs and/or the RAs in a data storage array.In one embodiment, the device interfaces 23 may perform data operationsusing a system cache that may be included in the global memory 25 b, forexample, when communicating with other device interfaces and othercomponents of the data storage array. The other portion 25 a is thatportion of the memory that may be used in connection with otherdesignations that may vary in accordance with each embodiment.

The particular data storage system as described in this embodiment, or aparticular device thereof, such as a disk or particular aspects of aflash device, should not be construed as a limitation. Other types ofcommercially available data storage systems, as well as processors andhardware controlling access to these particular devices, may also beincluded in an embodiment.

The host systems 14 a-14 n provide data and access control informationthrough channels to the storage systems 12, and the storage systems 12may also provide data to the host systems 14 a-n also through thechannels. The host systems 14 a-n do not address the drives or devices16 a-16 n of the storage systems directly, but rather access to data maybe provided to one or more host systems from what the host systems viewas a plurality of logical devices, logical volumes (LVs) which may alsoreferred to herein as logical units (e.g., LUNs). A logical unit (LUN)may be characterized as a disk array or data storage system reference toan amount of storage space that has been formatted and allocated for useto one or more hosts. A logical unit may have a logical unit number thatis an I/O address for the logical unit. As used herein, a LUN or LUNsmay refer to the different logical units of storage which may bereferenced by such logical unit numbers. The LUNs may or may notcorrespond to the actual or physical disk drives or more generallyphysical storage devices. For example, one or more LUNs may reside on asingle physical disk drive, data of a single LUN may reside on multipledifferent physical devices, and the like. Data in a single data storagesystem, such as a single data storage array, may be accessed by multiplehosts allowing the hosts to share the data residing therein. The HAs maybe used in connection with communications between a data storage arrayand a host system. The RAs may be used in facilitating communicationsbetween two data storage arrays. The DAs may include one or more type ofdevice interface used in connection with facilitating data transfersto/from the associated disk drive(s) and LUN (s) residing thereon. Forexample, such device interfaces may include a device interface used inconnection with facilitating data transfers to/from the associated flashdevices and LUN(s) residing thereon. It should be noted that anembodiment may use the same or a different device interface for one ormore different types of devices than as described herein.

In an embodiment in accordance with the techniques herein, the datastorage system as described may be characterized as having one or morelogical mapping layers in which a logical device of the data storagesystem is exposed to the host whereby the logical device is mapped bysuch mapping layers of the data storage system to one or more physicaldevices. Additionally, the host may also have one or more additionalmapping layers so that, for example, a host side logical device orvolume is mapped to one or more data storage system logical devices aspresented to the host.

It should be noted that although examples of the techniques herein maybe made with respect to a physical data storage system and its physicalcomponents (e.g., physical hardware for each HA, DA, HA port and thelike), the techniques herein may be performed in a physical data storagesystem including one or more emulated or virtualized components (e.g.,emulated or virtualized ports, emulated or virtualized DAs or HAs), andalso a virtualized or emulated data storage system including virtualizedor emulated components.

Also shown in the FIG. 1 is a management system 22 a that may be used tomanage and monitor the data storage system 12. In one embodiment, themanagement system 22 a may be a computer system which includes datastorage system management software or application such as may execute ina web browser. A data storage system manager may, for example, viewinformation about a current data storage configuration such as LUNs,storage pools, and the like, on a user interface (UI) in a displaydevice of the management system 22 a. Alternatively, and more generally,the management software may execute on any suitable processor in anysuitable system. For example, the data storage system managementsoftware may execute on a processor of the data storage system 12.

Information regarding the data storage system configuration may bestored in any suitable data container, such as a database. The datastorage system configuration information stored in the database maygenerally describe the various physical and logical entities in thecurrent data storage system configuration. The data storage systemconfiguration information may describe, for example, the LUNs configuredin the system, properties and status information of the configured LUNs(e.g., LUN storage capacity, unused or available storage capacity of aLUN, consumed or used capacity of a LUN), configured RAID groups,properties and status information of the configured RAID groups (e.g.,the RAID level of a RAID group, the particular PDs that are members ofthe configured RAID group), the PDs in the system, properties and statusinformation about the PDs in the system, local replicationconfigurations and details of existing local replicas (e.g., a scheduleor other trigger conditions of when a snapshot is taken of one or moreLUNs, identify information regarding existing snapshots for a particularLUN), remote replication configurations (e.g., for a particular LUN onthe local data storage system, identify the LUN's corresponding remotecounterpart LUN and the remote data storage system on which the remoteLUN is located), data storage system performance information such asregarding various storage objects and other entities in the system, andthe like.

Consistent with other discussion herein, management commands issued overthe control or data path may include commands that query or readselected portions of the data storage system configuration, such asinformation regarding the properties or attributes of one or more LUNs.The management commands may also include commands that write, update, ormodify the data storage system configuration, such as, for example, tocreate or provision a new LUN (e.g., which may result in modifying oneor more database tables such as to add information for the new LUN), tomodify an existing replication schedule or configuration (e.g., whichmay result in updating existing information in one or more databasetables for the current replication schedule or configuration), to deletea LUN (e.g., which may include deleting the LUN from a table of definedLUNs and may also include modifying one or more other database tables todelete any existing snapshots of the LUN being deleted), and the like.

It should be noted that each of the different controllers or adapters,such as each HA, DA, RA, and the like, may be implemented as a hardwarecomponent including, for example, one or more processors, one or moreforms of memory, and the like. Code may be stored in one or more of thememories of the component for performing processing.

The device interface, such as a DA, performs I/O operations on aphysical device or drive 16 a-16 n. In the following description, dataresiding on a LUN may be accessed by the device interface following adata request in connection with I/O operations. For example, a host mayissue an I/O operation which is received by the HA 21. The I/O operationmay identify a target location from which data is read from, or writtento, depending on whether the I/O operation is, respectively, a read or awrite operation request. The target location of the received I/Ooperation may be expressed in terms of a LUN and logical address oroffset location (e.g., LBA or logical block address) on the LUN.Processing may be performed on the data storage system to further mapthe target location of the received I/O operation, expressed in terms ofa LUN and logical address or offset location on the LUN, to itscorresponding physical storage device (PD) and location on the PD. TheDA which services the particular PD may further perform processing toeither read data from, or write data to, the corresponding physicaldevice location for the I/O operation.

It should be noted that an embodiment of a data storage system mayinclude components having different names from that described herein butwhich perform functions similar to components as described herein.Additionally, components within a single data storage system, and alsobetween data storage systems, may communicate using any suitabletechnique that may differ from that as described herein for exemplarypurposes. For example, element 12 of the FIG. 1 may be a data storagesystem, such as a data storage array, that includes multiple storageprocessors (SPs). Each of the SPs 27 may be a CPU including one or more“cores” or processors and each may have their own memory used forcommunication between the different front end and back end componentsrather than utilize a global memory accessible to all storageprocessors. In such embodiments, the memory 26 may represent memory ofeach such storage processor.

Generally, the techniques herein may be used in connection with anysuitable storage system, appliance, device, and the like, in which datais stored. For example, an embodiment may implement the techniquesherein using a midrange data storage system, such as a Dell EMC Unity®data storage system or a Dell EMC PowerStore® data storage system, aswell as a high end or enterprise data storage system, such as a DellEMC™ PowerMAX™ data storage system.

The data path or I/O path may be characterized as the path or flow ofI/O data through a system. For example, the data or I/O path may be thelogical flow through hardware and software components or layers inconnection with a user, such as an application executing on a host(e.g., more generally, a data storage client) issuing I/O commands(e.g., SCSI-based commands, and/or file-based commands) that read and/orwrite user data to a data storage system, and also receive a response(possibly including requested data) in connection such I/O commands.

The control path, also sometimes referred to as the management path, maybe characterized as the path or flow of data management or controlcommands through a system. For example, the control or management pathmay be the logical flow through hardware and software components orlayers in connection with issuing data storage management command toand/or from a data storage system, and also receiving responses(possibly including requested data) to such control or managementcommands. For example, with reference to the FIG. 1 , the controlcommands may be issued from data storage management software executingon the management system 22 a to the data storage system 12. Suchcommands may be, for example, to establish or modify data services,provision storage, perform user account management, and the like.Consistent with other discussion herein, the management commands mayresult in processing that includes reading and/or modifying informationin the database storing data storage system configuration information.For example, management commands that read and/or modify the datastorage system configuration information in the database may be issuedover the control path to provision storage for LUNs, create a snapshot,define conditions of when to create another snapshot, define orestablish local and/or remote replication services, define or modify aschedule for snapshot or other data replication services, define a RAIDgroup, obtain data storage management and configuration information fordisplay in a graphical user interface (GUI) of a data storage managementprogram or application, generally modify one or more aspects of a datastorage system configuration, list properties and status informationregarding LUNs or other storage objects (e.g., physical and/or logicalentities in the data storage system), and the like.

The data path and control path define two sets of different logical flowpaths. In at least some of the data storage system configurations, atleast part of the hardware and network connections used for each of thedata path and control path may differ. For example, although bothcontrol path and data path may generally use a network forcommunications, some of the hardware and software used may differ. Forexample, with reference to the FIG. 1 , a data storage system may have aseparate physical connection 29 from a management system 22 a to thedata storage system 12 being managed whereby control commands may beissued over such a physical connection 29. However, it may be that userI/O commands are never issued over such a physical connection 29provided solely for purposes of connecting the management system to thedata storage system. In any case, the data path and control path eachdefine two separate logical flow paths.

With reference to the FIG. 2 , shown is an example 100 illustratingcomponents that may be included in the data path in at least oneexisting data storage system in accordance with the techniques herein.The example 100 includes two processing nodes A 102 a and B 102 b andthe associated software stacks 104, 106 of the data path, where I/Orequests may be received by either processing node 102 a or 102 b. Inthe example 200, the data path 104 of processing node A 102 a includes:the frontend (FE) component 104 a (e.g., an FA or front end adapter)that translates the protocol-specific request into a storagesystem-specific request; a system cache layer 104 b where data istemporarily stored; an inline processing layer 105 a; and a backend (BE)component 104 c that facilitates movement of the data between the systemcache and non-volatile physical storage (e.g., back end physicalnon-volatile storage devices or PDs accessed by BE components such asDAs as described herein). During movement of data in and out of thesystem cache layer 104 b (e.g., such as in connection with read datafrom, and writing data to, physical storage 110 a, 110 b), inlineprocessing may be performed by layer 105 a. Such inline processingoperations of 105 a may be optionally performed and may include any oneof more data processing operations in connection with data that isflushed from system cache layer 104 b to the back-end non-volatilephysical storage 110 a, 110 b, as well as when retrieving data from theback-end non-volatile physical storage 110 a, 110 b to be stored in thesystem cache layer 104 b. In at least one embodiment, the inlineprocessing may include, for example, performing one or more datareduction operations such as data deduplication or data compression. Theinline processing may include performing any suitable or desirable dataprocessing operations as part of the I/O or data path.

In a manner similar to that as described for data path 104, the datapath 106 for processing node B 102 b has its own FE component 106 a,system cache layer 106 b, inline processing layer 105 b, and BEcomponent 106 c that are respectively similar to the components 104 a,104 b, 105 a and 104 c. The elements 110 a, 110 b denote thenon-volatile BE physical storage provisioned from PDs for the LUNs,whereby an I/O may be directed to a location or logical address of a LUNand where data may be read from, or written to, the logical address. TheLUNs 110 a, 110 b are examples of storage objects representing logicalstorage entities included in an existing data storage systemconfiguration. Since, in this example, writes directed to the LUNs 110a, 110 b may be received for processing by either of the nodes 102 a and102 b, the example 100 illustrates what may also be referred to as anactive-active configuration.

In connection with a write operation as may be received from a host andprocessed by the processing node A 102 a, the write data may be writtento the system cache 104 b, marked as write pending (WP) denoting itneeds to be written to the physical storage 110 a, 110 b and, at a laterpoint in time, the write data may be destaged or flushed from the systemcache to the physical storage 110 a, 110 b by the BE component 104 c.The write request may be considered complete once the write data hasbeen stored in the system cache whereby an acknowledgement regarding thecompletion may be returned to the host (e.g., by component the 104 a).At various points in time, the WP data stored in the system cache isflushed or written out to the physical storage 110 a, 110 b.

In connection with the inline processing layer 105 a, prior to storingthe original data on the physical storage 110 a, 110 b, one or more datareduction operations may be performed. For example, the inlineprocessing may include performing data compression processing, datadeduplication processing, and the like, that may convert the originaldata (as stored in the system cache prior to inline processing) to aresulting representation or form which is then written to the physicalstorage 110 a, 110 b.

In connection with a read operation to read a block of data, adetermination is made as to whether the requested read data block isstored in its original form (in system cache 104 b or on physicalstorage 110 a, 110 b), or whether the requested read data block isstored in a different modified form or representation. If the requestedread data block (which is stored in its original form) is in the systemcache, the read data block is retrieved from the system cache 104 b andreturned to the host. Otherwise, if the requested read data block is notin the system cache 104 b but is stored on the physical storage 110 a,110 b in its original form, the requested data block is read by the BEcomponent 104 c from the backend storage 110 a, 110 b, stored in thesystem cache and then returned to the host.

If the requested read data block is not stored in its original form, theoriginal form of the read data block is recreated and stored in thesystem cache in its original form so that it can be returned to thehost. Thus, requested read data stored on physical storage 110 a, 110 bmay be stored in a modified form where processing is performed by 105 ato restore or convert the modified form of the data to its original dataform prior to returning the requested read data to the host.

Also illustrated in FIG. 2 is an internal network interconnect 120between the nodes 102 a, 102 b. In at least one embodiment, theinterconnect 120 may be used for internode communication between thenodes 102 a, 102 b.

In connection with at least one embodiment in accordance with thetechniques herein, each processor or CPU may include its own privatededicated CPU cache (also sometimes referred to as processor cache) thatis not shared with other processors. In at least one embodiment, the CPUcache, as in general with cache memory, may be a form of fast memory(relatively faster than main memory which may be a form of RAM). In atleast one embodiment, the CPU or processor cache is on the same die orchip as the processor and typically, like cache memory in general, isfar more expensive to produce than normal RAM such as may be used asmain memory. The processor cache may be substantially faster than thesystem RAM such as used as main memory and contains information that theprocessor will be immediately and repeatedly accessing. The fastermemory of the CPU cache may, for example, run at a refresh rate that'scloser to the CPU's clock speed, which minimizes wasted cycles. In atleast one embodiment, there may be two or more levels (e.g., L1, L2 andL3) of cache. The CPU or processor cache may include at least an L1level cache that is the local or private CPU cache dedicated for useonly by that particular processor. The two or more levels of cache in asystem may also include at least one other level of cache (LLC or lowerlevel cache) that is shared among the different CPUs. The L1 level cacheserving as the dedicated CPU cache of a processor may be the closest ofall cache levels (e.g., L1-L3) to the processor which stores copies ofthe data from frequently used main memory locations. Thus, the systemcache as described herein may include the CPU cache (e.g., the L1 levelcache or dedicated private CPU/processor cache) as well as other cachelevels (e.g., the LLC) as described herein. Portions of the LLC may beused, for example, to initially cache write data which is then flushedto the backend physical storage such as BE PDs providing non-volatilestorage. For example, in at least one embodiment, a RAM based memory maybe one of the caching layers used as to cache the write data that isthen flushed to the backend physical storage. When the processorperforms processing, such as in connection with the inline processing105 a, 105 b as noted above, data may be loaded from the main memoryand/or other lower cache levels into its CPU cache.

In at least one embodiment, the data storage system may be configured toinclude one or more pairs of nodes, where each pair of nodes may begenerally as described and represented as the nodes 102 a-b in the FIG.2 . For example, a data storage system may be configured to include atleast one pair of nodes and at most a maximum number of node pairs, suchas for example, a maximum of 4 node pairs. The maximum number of nodepairs may vary with embodiment. In at least one embodiment, a baseenclosure may include the minimum single pair of nodes and up to aspecified maximum number of PDs. In some embodiments, a single baseenclosure may be scaled up to have additional BE non-volatile storageusing one or more expansion enclosures, where each expansion enclosuremay include a number of additional PDs. Further, in some embodiments,multiple base enclosures may be grouped together in a load-balancingcluster to provide up to the maximum number of node pairs. Consistentwith other discussion herein, each node may include one or moreprocessors and memory. In at least one embodiment, each node may includetwo multi-core processors with each processor of the node having a corecount of between 8 and 28 cores. In at least one embodiment, the PDs mayall be non-volatile SSDs, such as flash-based storage devices andstorage class memory (SCM) devices. It should be noted that the twonodes configured as a pair may also sometimes be referred to as peernodes. For example, the node A 102 a is the peer node of the node B 102b, and the node B 102 b is the peer node of the node A 102 a.

In at least one embodiment, the data storage system may be configured toprovide both block and file storage services with a system softwarestack that includes an operating system running directly on theprocessors of the nodes of the system.

In at least one embodiment, the data storage system may be configured toprovide block-only storage services (e.g., no file storage services). Ahypervisor may be installed on each of the nodes to provide avirtualized environment of virtual machines (VMs). The system softwarestack may execute in the virtualized environment deployed on thehypervisor. The system software stack (sometimes referred to as thesoftware stack or stack) may include an operating system running in thecontext of a VM of the virtualized environment. Additional softwarecomponents may be included in the system software stack and may alsoexecute in the context of a VM of the virtualized environment.

In at least one embodiment, each pair of nodes may be configured in anactive-active configuration as described elsewhere herein, such as inconnection with FIG. 2 , where each node of the pair has access to thesame PDs providing BE storage for high availability. With theactive-active configuration of each pair of nodes, both nodes of thepair process I/O operations or commands and also transfer data to andfrom the BE PDs attached to the pair. In at least one embodiment, BE PDsattached to one pair of nodes may not be shared with other pairs ofnodes. A host may access data stored on a BE PD through the node pairassociated with or attached to the PD.

In at least one embodiment, each pair of nodes provides a dual nodearchitecture where both nodes of the pair may be identical in terms ofhardware and software for redundancy and high availability. Consistentwith other discussion herein, each node of a pair may perform processingof the different components (e.g., FA, DA, and the like) in the datapath or I/O path as well as the control or management path. Thus, insuch an embodiment, different components, such as the FA, DA and thelike of FIG. 1 , may denote logical or functional components implementedby code executing on the one or more processors of each node. Each nodeof the pair may include its own resources such as its own local (i.e.,used only by the node) resources such as local processor(s), localmemory, and the like.

Data replication is one of the data services that may be performed on adata storage system in an embodiment in accordance with the techniquesherein. In at least one data storage system, remote replication is onetechnique that may be used in connection with providing for disasterrecovery (DR) of an application's data set. The application, such asexecuting on a host, may write to a production or primary data set ofone or more LUNs on a primary data storage system. Remote replicationmay be used to remotely replicate the primary data set of LUNs to asecond remote data storage system. In the event that the primary dataset on the primary data storage system is destroyed or more generallyunavailable for use by the application, the replicated copy of the dataset on the second remote data storage system may be utilized by thehost. For example, the host may directly access the copy of the data seton the second remote system. As an alternative, the primary data set ofthe primary data storage system may be restored using the replicatedcopy of the data set, whereby the host may subsequently access therestored data set on the primary data storage system. A remote datareplication service or facility may provide for automaticallyreplicating data of the primary data set on a first data storage systemto a second remote data storage system in an ongoing manner inaccordance with a particular replication mode, such as a synchronousmode described elsewhere herein.

Referring to FIG. 3 , shown is an example 2101 illustrating remote datareplication. It should be noted that the embodiment illustrated in FIG.3 presents a simplified view of some of the components illustrated inFIGS. 1 and 2 , for example, including only some detail of the datastorage systems 12 for the sake of illustration.

Included in the example 2101 are the data storage systems 2102 and 2104and the hosts 2110 a, 2110 b and 1210 c. The data storage systems 2102,2104 may be remotely connected and communicate over the network 2122,such as the Internet or other private network, and facilitatecommunications with the components connected thereto. The hosts 2110 a,2110 b and 2110 c may perform operations to the data storage system 2102over the connection 2108 a. The hosts 2110 a, 2110 b and 2110 c may beconnected to the data storage system 2102 through the connection 2108 awhich may be, for example, a network or other type of communicationconnection.

The data storage systems 2102 and 2104 may include one or more devices.In this example, the data storage system 2102 includes the storagedevice R1 2124, and the data storage system 104 includes the storagedevice R2 2126. Both of the data storage systems 2102, 2104 may includeone or more other logical and/or physical devices. The data storagesystem 2102 may be characterized as local with respect to the hosts 2110a, 2110 b and 2110 c. The data storage system 104 may be characterizedas remote with respect to the hosts 2110 a, 2110 b and 2110 c. The R1and R2 devices may be configured as LUNs.

The host 1210 a may issue a command, such as to write data to the deviceR1 of the data storage system 2102. In some instances, it may bedesirable to copy data from the storage device R1 to another secondstorage device, such as R2, provided in a different location so that ifa disaster occurs that renders R1 inoperable, the host (or another host)may resume operation using the data of R2. With remote replication, auser may denote a first storage device, such as R1, as a primary storagedevice and a second storage device, such as R2, as a secondary storagedevice. In this example, the host 2110 a interacts directly with thedevice R1 of the data storage system 2102, and any data changes made areautomatically provided to the R2 device of the data storage system 2104by a remote replication facility (RRF). In operation, the host 110 a mayread and write data using the R1 volume in 2102, and the RRF may handlethe automatic copying and updating of data from R1 to R2 in the datastorage system 2104. Communications between the storage systems 2102 and2104 may be made over connections 2108 b, 2108 c to the network 2122.

A RRF may be configured to operate in one or more different supportedreplication modes. For example, such modes may include synchronous modeand asynchronous mode, and possibly other supported modes. Whenoperating in the synchronous mode, the host does not consider a writeI/O operation to be complete until the write I/O has been completed onboth the first and second data storage systems. Thus, in the synchronousmode, the first or source storage system will not provide an indicationto the host that the write operation is committed or complete until thefirst storage system receives an acknowledgement from the second datastorage system regarding completion or commitment of the write by thesecond data storage system. In contrast, in connection with theasynchronous mode, the host receives an acknowledgement from the firstdata storage system as soon as the information is committed to the firstdata storage system without waiting for an acknowledgement from thesecond data storage system.

With synchronous mode remote data replication, a host 2110 a may issue awrite to the R1 device 2124. The primary or R1 data storage system 2102may store the write data in its cache at a cache location and mark thecache location as including write pending (WP) data as mentionedelsewhere herein. The RRF operating in the synchronous mode maypropagate the write data across an established connection or link (moregenerally referred to as a the remote replication link or link) such asover 2108 b, 2122, and 2108 c, to the secondary or R2 data storagesystem 2104 where the write data may be stored in the cache of thesystem 2104 at a cache location that is marked as WP. Once the writedata is stored in the cache of the system 2104 as described, the R2 datastorage system 2104 may return an acknowledgement to the R1 data storagesystem 2102 that it has received the write data. Responsive to receivingthis acknowledgement from the R2 data storage system 2104, the R1 datastorage system 2102 may return an acknowledgement to the host 2110 athat the write has been received and completed. Thus, generally, R1device 2124 and R2 device 2126 may be logical devices, such as LUNs,configured as mirrors of one another. R1 and R2 devices may be, forexample, fully provisioned LUNs, such as thick LUNs, or may be LUNs thatare thin or virtually provisioned logical devices.

With reference to FIG. 4 , shown is a further simplified illustration ofcomponents that may be used in in connection with remote replication.The example 2400 is simplified illustration of components as describedin connection with FIG. 2 . The element 2402 generally represents thereplication link used in connection with sending write data from theprimary R1 data storage system 2102 to the secondary R2 data storagesystem 2104. The link 2402, more generally, may also be used inconnection with other information and communications exchanged betweenthe systems 2101 and 2104 for replication. As mentioned above, whenoperating in synchronous replication mode, host 2110 a issues a write,or more generally, all I/Os including reads and writes, over a path toonly the primary R1 data storage system 2102. The host 2110 a does notissue I/Os directly to the R2 data storage system 2104. Theconfiguration of FIG. 4 may also be referred to herein as anactive-passive configuration such as may be used with synchronousreplication and other supported replication modes where the host 2110 ahas an active connection or path 2108 a over which all I/Os are issuedto only the R1 data storage system. The host 2110 a may have a passiveconnection or path 2404 to the R2 data storage system 2104.

In the configuration of 2400, the R1 device 2124 and R2 device 2126 maybe configured and identified as the same LUN, such as LUN A, to the host2110 a. Thus, the host 2110 a may view 2108 a and 2404 as two paths tothe same LUN A, where path 2108 a is active (over which I/Os may beissued to LUN A) and where path 2404 is passive (over which no I/Os tothe LUN A may be issued). For example, the devices 2124 and 2126 may beconfigured to have the same logical device identifier such as the sameworld wide name (WWN) or other identifier as well as having otherattributes or properties that are the same. Should the connection 2108 aand/or the R1 data storage system 2102 experience a failure or disasterwhereby access to R1 2124 configured as LUN A is unavailable, processingmay be performed on the host 2110 a to modify the state of path 2404 toactive and commence issuing I/Os to the R2 device configured as LUN A.In this manner, the R2 device 2126 configured as LUN A may be used as abackup accessible to the host 2110 a for servicing I/Os upon failure ofthe R1 device 2124 configured as LUN A.

The pair of devices or volumes including the R1 device 2124 and the R2device 2126 may be configured as the same single volume or LUN, such asLUN A. In connection with discussion herein, the LUN A configured andexposed to the host may also be referred to as a stretched volume ordevice, where the pair of devices or volumes (R1 device 2124, R2 device2126) is configured to expose the two different devices or volumes ontwo different data storage systems to a host as the same single volumeor LUN. Thus, from the view of the host 2110 a, the same LUN A isexposed over the two paths 2108 a and 2404.

It should be noted although only a single replication link 2402 isillustrated, more generally any number of replication links may be usedin connection with replicating data from systems 2102 to system 2104.

Referring to FIG. 5 , shown is an example configuration of componentsthat may be used in an embodiment. The example 2500 illustrates anactive-active configuration as may be used in connection withsynchronous replication in at least one embodiment. In the active-activeconfiguration with synchronous replication, the host 2110 a may have afirst active path 2108 a to the R1 data storage system and R1 device2124 configured as LUN A. Additionally, the host 2110 a may have asecond active path 2504 to the R2 data storage system and the R2 device2126 configured as the same LUN A. From the view of the host 2110 a, thepaths 2108 a and 2504 appear as 2 paths to the same LUN A as describedin connection with FIG. 4 with the difference that the host in theexample 2500 configuration may issue I/Os, both reads and/or writes,over both of the paths 2108 a and 2504 at the same time. The host 2110 amay send a first write over the path 2108 a which is received by the R1system 2102 and written to the cache of the R1 system 2102 where, at alater point in time, the first write is destaged from the cache of theR1 system 2102 to physical storage provisioned for the R1 device 2124configured as the LUN A. The R1 system 2102 also sends the first writeto the R2 system 2104 over the link 2402 where the first write iswritten to the cache of the R2 system 2104, where, at a later point intime, the first write is destaged from the cache of the R2 system 2104to physical storage provisioned for the R2 device 2126 configured as theLUN A. Once the first write is written to the cache of the R2 system2104, the R2 system 2104 sends an acknowledgement over the link 2402 tothe R1 system 2102 that it has completed the first write. The R1 system2102 receives the acknowledgement from the R2 system 2104 and thenreturns an acknowledgement to the host 2110 a over the path 2108 a,where the acknowledgement indicates to the host that the first write hascompleted.

The host 2110 a may also send a second write over the path 2504 which isreceived by the R2 system 2104 and written to the cache of the R2 system2104 where, at a later point in time, the second write is destaged fromthe cache of the R2 system 2104 to physical storage provisioned for theR2 device 2126 configured as the LUN A. The R2 system 2104 also sendsthe second write to the R1 system 2102 over a second link 2502 where thesecond write is written to the cache of the R1 system 2102, and where,at a later point in time, the second write is destaged from the cache ofthe R1 system 2102 to physical storage provisioned for the R1 device2124 configured as the LUN A. Once the second write is written to thecache of the R1 system 2102, the R1 system 2102 sends an acknowledgementover the link 2502 to the R2 system 2104 that it has completed thesecond write. Once the R2 system 2104 receives the acknowledgement fromthe R1 system (regarding completion of the second write), the R2 system2104 then returns an acknowledgement to the host 2110 a over the path2504 that the second write has completed.

As discussed in connection with FIG. 4 , the FIG. 5 also includes thepair of devices or volumes—the R1 device 2124 and the R2 device2126—configured as the same single stretched volume, the LUN A. From theview of the host 2110 a, the same stretched LUN A is exposed over thetwo active paths 2504 and 2108 a.

In the example 2500, the illustrated active-active configurationincludes the stretched LUN A configured from the device or volume pair(R1 2124, R2 2126), where the device or object pair (R1 2124, R2, 2126)is further configured for synchronous replication from the system 2102to the system 2104, and also configured for synchronous replication fromthe system 2104 to the system 2102. In particular, the stretched LUN Ais configured for dual, bi-directional or two way synchronous remotereplication: synchronous remote replication of writes from R1 2124 to R22126, and synchronous remote replication of writes from R2 2126 to R12124. To further illustrate synchronous remote replication from thesystem 2102 to the system 2104 for the stretched LUN A, a write to thestretched LUN A sent over 2108 a to the system 2102 is stored on the R1device 2124 and also transmitted to the system 2104 over 2402. The writesent over 2402 to system 2104 is stored on the R2 device 2126. Suchreplication is performed synchronously in that the received host writesent over 2108 a to the data storage system 2102 is not acknowledged assuccessfully completed to the host 2110 a unless and until the writedata has been stored in caches of both the systems 2102 and 2104.

In a similar manner, the illustrated active-active configuration of theexample 2500 provides for synchronous replication from the system 2104to the system 2102, where writes to the LUN A sent over the path 2504 tosystem 2104 are stored on the device 2126 and also transmitted to thesystem 2102 over the connection 2502. The write sent over 2502 is storedon the R2 device 2124. Such replication is performed synchronously inthat the acknowledgement to the host write sent over 2504 is notacknowledged as successfully completed unless and until the write datahas been stored in the caches of both the systems 2102 and 2104.

It should be noted that although FIG. 5 illustrates for simplicity asingle host accessing both the R1 device 2124 and R2 device 2126, anynumber of hosts may access one or both of the R1 device 2124 and the R2device 2126.

Although only a single link 2402 is illustrated in connection withreplicating data from systems 2102 to system 2104, more generally anynumber of links may be used. Although only a single link 2502 isillustrated in connection with replicating data from systems 2104 tosystem 2102, more generally any number of links may be used.Furthermore, although 2 links 2402 and 2502 are illustrated, in at leastone embodiment, a single link may be used in connection with sendingdata from system 2102 to 2104, and also from 2104 to 2102.

FIG. 5 illustrates an active-active remote replication configuration forthe stretched LUN A. The stretched LUN A is exposed to the host byhaving each volume or device of the device pair (R1 device 2124, R2device 2126) configured and presented to the host as the same volume orLUN A. Additionally, the stretched LUN A is configured for two waysynchronous remote replication between the two devices or volumes of thedevice pair.

In an embodiment described herein, the data storage system may be aSCSI-based system such as SCSI-based data storage array. An embodimentin accordance with the techniques herein may include hosts and datastorage systems which operate in accordance with the standard SCSIAsymmetrical Logical Unit Access (ALUA). The ALUA standard specifies amechanism for asymmetric or symmetric access of a logical unit or LUN asused herein. ALUA allows the data storage system to set a LUN's accessstate with respect to a particular initiator port and the target port.Thus, in accordance with the ALUA standard, various access states (alsosometimes referred to herein as ALUA states or path states) may beassociated with a path with respect to a particular device, such as aLUN. In particular, the ALUA standard defines such access statesincluding the active-optimized, active-non optimized, and unavailablestates as described herein. The ALUA standard also defines other accessstates, such as standby and in-transition or transitioning (i.e.,denoting that a particular path is in the process of transitioningbetween states for a particular LUN). A recognized path (such asrecognized by a host as a result of discovery processing) over whichI/Os (e.g., read and write I/Os) may be issued to access data of a LUNmay have an “active” state, such as active-optimized oractive-non-optimized. Active-optimized is an active path to a LUN thatis preferred over any other path for the LUN having an “active-nonoptimized” state. A path for a particular LUN having theactive-optimized path state may also be referred to herein as anoptimized or preferred path for the particular LUN. Thusactive-optimized denotes a preferred path state for the particular LUN.A path for a particular LUN having the active-non optimized (orunoptimized) path state may also be referred to herein as anon-optimized or non-preferred path for the particular LUN. Thusactive-non-optimized denotes a non-preferred path state with respect tothe particular LUN. Generally, I/Os directed to a LUN that are sent bythe host to the data storage system over active-optimized and active-nonoptimized paths are processed by the data storage system. However, thehost may select to send I/Os to a LUN from those paths having anactive-optimized state for the LUN. The host may proceed to use a pathhaving an active non-optimized state for the LUN only if there is noactive-optimized path for the LUN. A recognized path over which I/Os maynot be issued to access data of a LUN may have an “unavailable” state.When a path to a LUN is in the unavailable state, a limited set ofnon-I/O-based commands (e.g. other than read and write commands to,respectively, read and write user data), such as the SCSI INQUIRY, maybe issued. It should be noted that such limited set of non I/O basedcommands may also be issued over an active (e.g., active optimized andactive non-optimized) path as well.

Referring to FIG. 6 , shown is an example of an embodiment of a systemthat may be utilized in connection with the techniques herein. Theexample 300 includes a host 302, a network 340 and a data storage system320. The host 302 and the data storage system 320 may communicate overone or more paths 340 a-d through the network 340. The paths 340 a-d aredescribed in more detail below. The LUNs A and B are included in the set330, and the LUNs C and D are included in the set 332. The LUNs of thesets 330 and 332 are configured from non-volatile BE storage PDs of thedata storage system 320. The data storage system includes two nodes—nodeA 322 and node B 324. The nodes 322, 324 may be as described elsewhereherein. The element 301 denotes an internode communication connectionsimilar, for example, to the connection 120 of FIG. 2 . Consistent withother discussion herein such as in connection with FIG. 2 , the BE PDsfrom which storage is provisioned for the LUNs of 330, 332 areaccessible to both the nodes 322, 324.

The host 202 may include an application 304, a multi-path (MP) driver306 and other components 308. The other components 308 may include, forexample, one or more other device drivers, an operating system, andother code and components of the host. An I/O operation from theapplication 304 may be communicated to the data storage system 320 usingthe MP driver 306 and one or more other components of the data path orI/O path. The application 304 may be a database or other applicationwhich issues data operations, such as I/O operations, to the datastorage system 320. Each of the I/O operations may be directed to a LUN,such as one of the LUNs of 330, 332, configured to be accessible to thehost 302 over multiple physical paths. As such, each of the I/Ooperations may be forwarded from the application 304 to the data storagesystem 320 over one of the possible multiple paths.

The MP driver 306 may include functionality to perform any one or moredifferent types of processing such as related to multipathing. Forexample, the MP driver 306 may include multipathing functionality formanagement and use of multiple paths. For example, the MP driver 306 mayperform path selection to select one of the possible multiple pathsbased on one or more criteria such as load balancing to distribute I/Orequests for the target device across available active-optimized orpreferred paths. Host side load balancing may be performed by the MPdriver to provide for better resource utilization and increasedperformance of the host, data storage system, and network or otherconnection infrastructure. The host 302 may also include othercomponents 308 such as one or more other layers of software used inconnection with communicating the I/O operation from the host to thedata storage system 120. For example, element 108 may include FibreChannel (FC), SCSI and NVMe (Non-Volatile Memory Express) drivers, alogical volume manager (LVM), and the like. It should be noted thatelement 308 may include software or other components used when sendingan I/O operation from the application 304 where such components includethose invoked in the call stack of the data path above the MP driver 306and also below the MP driver 306. For example, application 304 may issuean I/O operation which is communicated in the call stack including anLVM, the MP driver 306, and a SCSI driver.

The data storage system 320 may include one or more BE PDs configured tostore data of one or more LUNs. Each of the LUNs 330, 332 may beconfigured to be accessible to the host 302 through multiple paths. Thenode A 322 in this example has two data storage system target ports T1and T2. The node B 324 in this example has two data storage systemtarget ports T3 and T4. The host 302 includes 4 host initiator ports I1,I2, I3 and I4. The path 340 a is formed using the endpoints I1 and T1and may be denoted as I1-T1. The path 340 b is formed using theendpoints I2 and T2 and may be denoted as I2-T2. The path 340 c isformed using the endpoints I3 and T3 and may be denoted as I3-T3. Thepath 340 d is formed using the endpoints I4 and T4 and may be denoted asI4-T4.

In at least one embodiment in accordance with the SCSI standard, each ofthe initiators and target ports in FIG. 6 as well as other figuresherein may be unique WWNs.

In this example, all of the LUNs A, B C and D may be accessible orexposed over all the data storage system target ports T1, T2, T3 and T4over the paths 340 a-d. As described in more detail below, a first setof paths to the node A 322 may be specified as active-optimized orpreferred for the LUNs of the set 330 and a second set of paths to thenode B 324 may be specified as active-optimized or preferred for theLUNs of the set 332. Additionally the first set of paths to the node A322 may be specified as active-non optimized or non-preferred for theLUNs of the set 332 and the second set of paths to the node B 324 may bespecified as active-non optimized or non-preferred for the LUNs of theset 330.

The multiple active paths allow the application I/Os to the LUNs A, B Cand D to be routed over the multiple paths 340 a-d and, more generally,allow the LUNs A, B C and D to be accessed over the multiple paths 340a-d. In the event that there is a component failure in one of theactive-optimized multiple paths for a particular LUN, application I/Osdirected to the particular LUN can be easily routed over other alternatepreferred paths unaffected by the component failure. Additionally, inthe event there are no preferred paths available for issuing I/Os to theparticular LUN, non-preferred paths for the particular LUN may be usedto send the I/Os to the particular LUN. Thus, an embodiment of the MPdriver 306 may also perform other processing in addition to loadbalancing in connection with path selection. The MP driver 106 may beaware of, and may monitor, all paths between the host and the LUNs A, BC and D in order to determine that particular state of such paths withrespect to the various LUNs. In this manner, the MP driver may determinewhich of the multiple paths over which a LUN is visible may be used forissuing I/O operations successfully. Additionally, the MP driver may usesuch information to select a path for host-data storage systemcommunications issued to the particular LUN.

In the example 300, each of the LUNs A, B C and D may be exposed throughthe 4 paths 340 a-d. As described in more detail below, each of thepaths 340 a-d may have an associated ALUA state also used by the hostwhen issuing I/O operations. Each path 340 a-d may be represented by twopath endpoints—a first endpoint on the host 302 and a second endpoint onthe data storage system 320. The first endpoint may correspond to a portof a host component, such as a host bus adapter (HBA) of the host 302,and the second endpoint may correspond to a target port of a datastorage system component, such as a target port of a node of the datastorage system 320. In the example 300, the elements I1, I2, I3 and I4each denote a port of the host 302 (e.g. such as a port of an HBA), andthe elements T1, T2 T3 and T4 each denote a target port of a node of thedata storage system 320.

The MP driver 306, as well as other components of the host 302, mayexecute in kernel mode or other privileged execution mode. In oneembodiment using a Unix-based operating system, the MP driver 306 mayexecute in kernel mode. In contrast, the application 304 may typicallyexecute in user mode, or more generally, a non-privileged executionmode. Furthermore, it will be appreciated by those skilled in the artthat the techniques herein may be used in an embodiment having any oneof a variety of different suitable operating systems including aUnix-based operating system as mentioned above, any one of the MicrosoftWindows® operating systems, a virtualized environment, such as using theVMware™ ESX hypervisor by VMware, Inc, and the like.

In operation, the application 304 may issue one or more I/O operations(e.g., read and write commands or operations) directed to the LUNs 330,332 of the data storage system. Such I/O operations from the application304 may be directed to the MP driver 306 after passing through anyintervening layers of the data or I/O path.

In connection with the SCSI standard, a path may be defined between twoports as described above. A command may be sent from the host (as wellas a component thereof such as a HBA) and may be characterized as aninitiator, originator or source with respect to the foregoing path. Thehost, as the initiator, sends requests to a data storage system (as wellas a particular component thereof such as node having a port with anetwork address) characterized as a target, destination, receiver, orresponder. Each physical connection of a path may be between a firstendpoint which is an initiator port (e.g., I1) of the host and a secondendpoint (e.g., T1) which is a target port of node in the data storagesystem. Over each such path, one or more LUNs may be visible or exposedto the host initiator through the target port of the data storagesystem.

In connection with some protocols such as the SCSI protocol, each pathas related to sending and receiving of I/O commands may include 2endpoints. As discussed herein, the host, or port thereof, may be aninitiator with respect to I/Os issued from the host to a target port ofthe data storage system. In this case, the host and data storage systemports are examples of such endpoints. In the SCSI protocol,communication may be unidirectional in that one of the endpoints, suchas the host HBA port, is the initiator and the other endpoint, such asthe data storage system target port, is the target receiving thecommands from the initiator.

An I/O command or operation, such as a read or write operation, from thehost to the data storage system may be directed to a LUN and a logicaladdress or location in the LUN's logical address space. The logicaladdress or location of the LUN may be characterized as the targetlogical address of the I/O operation. The target logical address orlocation of the I/O operation may identify a LBA within the definedlogical address space of the LUN. The I/O command may include variousinformation such as identify the particular type of I/O command as reador write, identify the target logical address (e.g., LUN and LUN logicaladdress) of the I/O command, and other information. In connection withservicing the I/O operation, the data storage system may map the targetlogical address to a physical storage location on a PD of the datastorage system. The physical storage location may denote the physicalstorage allocated or provisioned and also mapped to the target logicaladdress.

In an embodiment described herein, the data storage system 320 may be aSCSI-based system such as SCSI-based data storage array operating inaccordance with the ALUA standard. As described herein, a data storagesystem in accordance with techniques herein may set an access path statefor a particular LUN over a particular path from an initiator to atarget of the data storage system. For example, the data storage systemmay set an access path state for a particular LUN on a particular pathto active-optimized (also referred to herein as simply “optimized” or“preferred”) to denote the path as a preferred path for sending I/Osdirected to the LUN. The data storage system may set an access pathstate for a particular LUN on a particular path to active-non optimized(also referred to herein as simply “non-optimized” or “non-preferred”)to denote a non-preferred path for sending I/Os directed to the LUNsent. The data storage system may also set the access path state for aparticular LUN on a particular path to other suitable access states.Although discussion herein may refer to the data storage system settingand modifying the path access states of the paths between the host andthe data storage system, in some embodiments, a host may also set and/ormodify the path access states which are then communicated to the datastorage system.

In accordance with the techniques herein, the data storage system mayset the path state for a particular LUN to preferred or non-preferredfor any suitable purpose. In at least one embodiment, multipathingsoftware, such as the MP driver, on the host may monitor the particularaccess path state as may be set by the data storage system with respectto a particular LUN to determine which path to select for sending I/Osto the LUN. Thus, when the LUN is exposed to a host initiator overmultiple paths (e.g., where the same LUN is accessible through multipledifferent target ports of the data storage system), the data storagesystem may vary the associated access state of each such path in orderto vary and control the particular ones of the multiple paths over whichthe host may issue I/Os to the LUN.

The element 330 indicates that the LUN A and the LUN B are exposed tothe host 302 over preferred paths to the node A 322 and non-preferredpaths to the node B 324. The element 332 indicates that the LUN C andthe LUN D are exposed to the host 302 over preferred paths to the node B324 and non-preferred paths to the node A 322. Thus, the paths 340 c-dto the target ports T3 and T4 of node B 324 are set to optimized orpreferred for the LUNs C and D and set to non-optimized or non-preferredfor the remaining LUNs A and B; and the paths 340 a-b to the targetports T1 and T2 of node A 322 are set to preferred or optimized for theLUNs A and B and set to non-optimized or non-preferred for the remainingLUNs C and D.

In at least one embodiment, target ports are given identifiers and maybe organized into target port groups (TPGs). In at least one embodiment,a TPG may be defined as a logical grouping or collection of one or moretarget port identifiers that share the same access characteristics for aparticular LUN. For example, target ports T1 and T2 may be included in afirst TPG and target ports T3 and T4 may be included in a second TPG.With ALUA in at least one embodiment, a LUN may be visible with respectto the entire TPG rather than on a port level basis. In other words, aLUN may be exposed or visible on a TPG level. If the LUN is visible oraccessible on a first target port in the first TPG including that firsttarget port, then the LUN is also accessible or visible on all targetsports of the first TPG. Each TPG can take on a state (e.g., preferred ornon-preferred). For a given LUN, the LUN is visible on the TPG levelbasis (e.g. with respect to all target ports of a TPG). Thus the LUN hasthe same path state or access characteristic with respect to all targetports of the same TPG. For example, the first TPG noted above mayinclude all target ports of one of the nodes such as node A 322 overwhich the LUNs A, B, C and D are exposed; and the second TPG noted abovemay include all target ports of one of the nodes such as node B 324 overwhich the LUNs A, B, C and D are exposed.

The table 310 denotes the different path states for each of the 4 pathsfor the 4 LUNs A, B, C and D. The table 310 reflects the path states asdiscussed above. The row 312 indicates that path I1-T1 including thetarget port T1 of node A 322 is active optimized (opt) or preferred forthe LUNs A and B and active non-optimized (non-opt) or non-preferred forthe LUNs C and D. The row 314 indicates that path I2-T2 including thetarget port T2 of node A 322 is optimized (opt) or preferred for theLUNs A and B and non-optimized (non-opt) or non-preferred for the LUNs Cand D. The row 316 indicates that path I3-T3 including the target portT3 of node B 324 is optimized (opt) or preferred for the LUNs C and Dand non-optimized (non-opt) or non-preferred for the LUNs A and B. Therow 318 indicates that path I4-T4 including the target port T4 of node B324 is optimized (opt) or preferred for the LUNs C and D andnon-optimized (non-opt) or non-preferred for the LUNs A and B.

Assume further, for example, the node B 324 of the data storage system320 now experiences a failure so that the target ports T3 and T4 andthus the paths 340 c, 340 d are unavailable. In response to the failureof the node B 324 and the target ports T3 and T4, the path states may beupdated from the states of the table 310 to the revised path states ofthe table 320. In the table 320, due to the failure and unavailabilityof the paths 340 c-d, 1) the path states of 322 indicate that the path340 a I1-T1 and the path 340 b I2-T2 have transitioned from thenon-optimized to the optimized or preferred path state for the LUNs Cand D; and 2) the path states of 324 indicate that the path I3-T3 340 cand the path 340 d I4-T4 for the LUNs A, B, C and D have transitioned tothe unavailable state.

It is noted that other embodiments may have different path state changesthan as denoted by the table 320.

A metro cluster configuration may be used herein to refer to aconfiguration including two data storage systems respectively configuredwith two devices or volumes with the same identity that cooperate toexpose a stretched volume or LUN, such as in the FIGS. 4 and 5 , to oneor more hosts. In the metro cluster configuration, the hosts andapplications running on the hosts perceive the two devices or volumesconfigured to have the same identity as the same single stretchedvolume, device or LUN.

In a metro cluster configuration, each of the two data storage systemsmay be in different data centers or may be in two server rooms ordifferent physical locations within the same data center. The metrocluster configuration may be used in a variety of different use casessuch as, for example, increased availability and disaster avoidance andDR, resource balancing across data centers and data storage systems, andstorage migration.

In a metro cluster configuration, hosts may be configured with uniformhost connectivity as illustrated in FIGS. 4 and 5 , where a host may beconnected to both data storage systems exposing the pair of devices orvolumes configured as the same stretched volume or LUN, such as the LUNA described in connection with FIG. 5 . From the perspective of the host2110 a of FIG. 5 , the data storage system 2102 may be a local datastorage system included in the same data center as the host 2110 a, andthe data storage system 2104 may be a remote data storage system. Thusthe host 2110 a is configured with uniform host connectivity. Incontrast to uniform host connectivity is non-uniform host connectivity,where the host is only connected to the local data storage system butnot the remote data storage system of the metro cluster configuration.

Referring to FIG. 7A, shown is a more detailed illustration of a metrocluster configuration. The example 400 includes a stretched volume orLUN A and two hosts configured 412, 414 with uniform host connectivityin at least one embodiment.

In the FIG. 7A, the host 1 412 and the data storage system 1 410 are inthe data center 1 420 a. The host 2 414 and the data storage system 2430 are in the data center 2 420 b. The host 1 412 includes theinitiators I11-I14. The host 432 includes the initiators I31-I34. Thedata storage systems 410, 430 may be dual node data storage systems suchas described elsewhere herein (e.g., FIG. 2 ). The data storage system410 includes the node A 410 a with the target ports T11-T12, and thenode B 410 b with the target ports T13-T14. The data storage system 430includes the node A 430 a with the target ports T31-T32, and the node B430 b with the target ports T33-T34. From the perspective of host 1 412,the data storage system 1 410 and data center 1 420 a may becharacterized as local, and the data storage system 2 430 and the datacenter 2 420 b may be characterized as remote. From the perspective ofhost 2 432, the data storage system 1 410 and data center 1 420 a may becharacterized as remote, and the data storage system 2 430 and the datacenter 2 420 b may be characterized as local.

As illustrated in the FIG. 7A, the stretched volume or LUN A isconfigured from the device or volume pair LUN A 425 a and LUN A″ 425 b,where both the LUNs or volumes 425 a-b are configured to have the sameidentity from the perspective of the hosts 412, 432. The LUN A 425 a andthe LUN A″ 425 b are configured for two way synchronous remotereplication 402 which, consistent with other description herein,provides for automated synchronous replication of writes of the LUN A425 a to the LUN A″ 425 b, and also automated synchronous replication ofwrites of the LUN A″ 425 b to the LUN A 425 a. The LUN A 425 a may beexposed to the hosts 412, 432 over the target ports T11-T14 of thesystem 410, and the LUN A″ 425 b may be exposed to the hosts 412, 432over the target ports T31-T34.

In at least one embodiment in which the arrangement of FIG. 7A is inaccordance with the ALUA protocol, the paths 423 a-f may be configuredwith the path state of active non-optimized and the paths 422 a-b may beconfigured with the path state of active optimized. Thus, the host 412has uniform host connectivity to the stretched volume or LUN A by theactive connections or paths 422 a (I11-T11), 423 a (I12-T13) to the datastorage system 410 exposing the LUN A 425 a, and the active connectionsor paths 423 b (I13-T31), 423 c (I14-T33) to the data storage system 430exposing the LUN A″ 425 b. The host 432 has uniform host connectivity tothe stretched volume or LUN A by the active connections or paths 423 d(I31-T12), 423 e (I32-T14) to the data storage system 410 exposing theLUN A 425 a, and the active connections or paths 422 b (I33-T32), 423 f(I34-T34) to the data storage system 430 exposing the LUN A″ 425 b.

Uniform host connectivity deployments such as illustrated in FIG. 7Aoffer high resiliency to failure of any local component or cross datacenter connection. Failures such as a total loss of a local storagesystem (that is local from a host's perspective) result in the hostperforming I/Os using the cross-datacenter links to the remote datastorage system, which results in increased latency but does not requireimmediate application restart since I/Os issued from the host are stillserviced using the remote data storage system. FIG. 7A illustrates aconfiguration that may also be referred to as a metro clusterconfiguration with a pair of data storage systems 410, 430. With respectto a host, such as the host 412, one of the data storage systems, suchas the system 410, may be local and in the same data center as the host,and the other remaining data storage system, such as the system 430, maybe remote and in a different location or data center than the host 412.

With reference to FIG. 7A, the element 411 denotes the data storagesystem management software application A for the system 410, and theelement 413 denotes the data storage system management application B forthe system 430. The management applications 411 and 413 may communicatewith one another through a network or other suitable communicationconnection when performing the processing needed for the techniquesdescribed herein. The element 411 a represents the management database(DB) A that stores management and other information used by themanagement application A 411 for the system 410. The element 413 arepresents the management DB B that stores management and otherinformation used by the management application B 413 for the system 430.

To further illustrate, the FIG. 7A may denote the path states at a firstpoint in time T1. At a second point in time T2 subsequent to T1 andillustrated in the FIG. 7B, the data storage system 2 430 may experiencea failure or disaster where the LUN A″ 425 b on data storage on thesystem 430 is unavailable and cannot be accessed through the targetports T31-34. In response to the unavailability of the data storagesystem 430, the host 2 432 uses the path 454 b to issue I/Os to the LUNA 425 a on the data storage system 410. Thus, failure of the system 430that is local to the host 432 results in the host 432 performing I/Osusing the cross-data center link 454 b to the remote system 410 whichresults in increased latency but does not require immediate applicationrestart since I/Os issued by the application 3 (app 3) on the host 432may still be serviced using the remote system 410.

In response to the unavailability of the data storage system 430, thepaths 452 a-d to the system 430 transition to the unavailable pathstate, the path 454 a remains active optimized, the path 454 btransitions from active non-optimized to active optimized, and theremaining paths 456 a-b remain active non-optimized.

FIG. 7A illustrates connectivity between the hosts 412, 432 and the datastorage systems 410, 430 under normal operating conditions where bothsystems 410, 430 and both volumes or LUNs 425 a, 425 b are available tothe hosts 412, 432 for servicing I/Os. In such normal operatingconditions, the ALUA path states may be as described in connection withFIG. 7A where each of the hosts 412, 432 issues I/Os to the particularone of the systems 410, 430 that is local or in the same data center asthe particular host. In such normal operating conditions as illustratedin FIG. 7A, at least one “local path” between the host and the localdata storage system is active optimized, and remote paths between thehost and the remote data storage system are active non-optimized. One ormore of the remote paths with respect to a particular host may be usedin the event the local data storage system and/or local paths to thelocal data storage system are unavailable such as described inconnection with FIG. 7B with respect to the host 412.

Thus, in the absence of a data storage system failure and under normaloperating conditions such as illustrated in FIG. 7A, the host 412 issuesI/Os to its local data storage system 410 where the host 412 and thesystem 410 are located in the same data center 420 a; and the host 432issues I/Os to its local data storage system 430 where the host 432 andthe system 430 are located in the same data center 420 b.

Generally, there are several ways to accomplish having each host undernormal conditions issue I/Os to a local data storage system in the samedata center as the host.

In some implementations, a native host multi-path driver or a thirdparty multi-path drive may be able to differentiate the particular pathsto the local data storage system and the particular paths to the remotedata storage system based on path latency. Generally the pathsexperiencing the largest latencies when sending an I/O may be determinedas those to the remote data system, and those with the smallestlatencies may be determined as those to the local data storage system.In such implementations, the host utilizes its multi-path driver toselect a particular path to a local data storage system over which tosend I/Os.

In at least one embodiment, processing may be performed consistent withdiscussion elsewhere herein where the data storage systems determine theALUA path states, such as in connection with FIGS. 6, 7A and 7B, andexpose or communicate such ALUA path states (also sometimes referred toherein access states) to the hosts. Thus, when the LUN is exposed to ahost initiator over multiple paths (e.g., where the same LUN isaccessible through multiple different target ports of the data storagesystem), the data storage systems may vary the associated access stateof each such path in order to vary and control the particular ones ofthe multiple paths over which the host may issue I/Os to the LUN. Inparticular, processing may be performed by the data storage systems,such as the systems 410, 430 of FIGS. 7A and 7B, to determine whichparticular paths to the hosts 412, 432 are active optimized and whichare active non-optimized at various points in time. The processing mayinclude the data storage systems 410, 430 communicating the path statesto the hosts 412, 432 and then also notifying the hosts 412, 432 whenthere are any changes to the path states, such as in response to a datastorage system failure such as illustrated in FIG. 7B. In this manner,the hosts 412, 432 may select paths over which to send I/Os based on theparticular ALUA path states or access states for particular volumes orLUNs as determined and communicated by the data storage systems 410,430, where I/Os are sent by the hosts over those active-optimized paths.

Consistent with discussion herein such as in connection with FIGS. 5, 7Aand 7B, a stretched volume or LUN is configured from a LUN or volumepair (R1, R2), where R1 and R2 are different instances of LUNs orvolumes respectively on two data storage systems of the metro cluster.Further, the volumes R1 and R2 are configured to have the same identityand appear to a host as the same volume or LUN. Thus a volume or LUN ona first local data storage system may be characterized as stretched ifthat volume or LUN also has a matching counterpart remote volume or LUNon the other remote data storage system of the metro cluster pair.

In contrast to the stretched volume or LUN is an unstretched ornon-stretched volume or LUN. A volume or LUN may be characterized as anunstretched volume or LUN existing on only one data storage systemwithin the metro cluster pair.

An operation referred to herein as stretching a LUN or volume may beapplied to an unstretched LUN or volume whereby a local unstretchedvolume or LUN on only one of the data storage systems of the metrocluster pair is converted to a stretched LUN or volume. Converting theunstretched volume or LUN of a first local data storage system of themetro cluster pair to a stretched volume may include creating acounterpart remote LUN on the second remote data storage system of themetro configuration. Consistent with other discussion herein regarding astretched volume or LUN, from the external host perspective, thecounterpart remote LUN is configured to have the same identity as thenon-stretched LUN on the first data storage system. In connection withstretching an existing local unstretched LUN having the normalattribute, the local LUN has its attribute modified to stretched todenote a stretched volume.

In connection with stretching a LUN or creating a stretched LUN, such ascreating the stretched LUN A or stretching the LUN A 425 a resulting inthe stretched LUN or volume configuration with the volumes 425 a and 425b as illustrated in the FIG. 7A, ALUA path state changes may be made sothat the host 1 412 local to the storage system 410 has one or moreactive optimized paths to the local stretched LUN copy 425 a on thesystem 410 and one or more active non-optimized paths to the remotestretched LUN copy 425 b on the system 430. Additionally, ALUA pathstate changes may be made so that the host 2 432 local to the storagesystem 430 has one or more active optimized paths to the local stretchedLUN copy 425 b on the system 430 and one or more active non-optimizedpaths to the remote stretched LUN copy 425 a on the system 410. In somecontexts as discussed herein, a LUN or volume and data storage systemmay be characterized as local with respect to a host if the host, LUNand data storage system are located in the same data center. Also insome contexts as discussed herein, a volume or LUN may be characterizedas having local target ports and local TPGs over which the LUN isexposed to a host. In this case, such local ports and local TPGs may becharacterized as local with respect to the LUN in that the LUN, localports and local TPGs are all included in the same data storage system.

An unstretched volume or LUN of a data storage system included in a datacenter may be exposed to a host that is local to the data storage systemwhereby the host and the data storage system are included in the samedata center. In this case in an embodiment in accordance with the ALUAstandard, the unstretched volume is exposed to the host over at leastone path from the data storage system to the host where the at least onepath is active optimized. It should be noted that in some instances,under failure conditions, all active optimized paths may be off-line orunavailable whereby only active non-optimized paths remain as available.In this case, the active non-optimized path(s) may be used by the host.

Consistent with other discussion herein, depending on the data storagesystem implementation, only a single ALUA path within a local datacenter with respect to a host for a stretched volume may be activeoptimized such as illustrated in FIG. 7A. In contrast to the foregoing,alternatively, more than a single ALUA path within a local data centerfor a particular host may be active optimized for the stretched volume.However, in such embodiments consistent with other discussion herein,paths from a host to a remote data storage system and a remote datacenter for a remote copy of the stretched volume may be activenon-optimized in order to make the host prefer to use local paths to thelocal copy of the stretched volume. It should be noted that whileparticular figures such as FIG. 7A may show just a single activeoptimized path for simplicity, in most real-life deployments, paths maybetween the host and a data storage system may have an associated accesspath state at the group level, such as based on a group of target portsas discussed elsewhere herein.

In connection with the data storage systems, or more particularly, thecontrol path and management software of the data storage systems settingand modifying ALUA path states for exposed volumes or LUNs, the controlpath and management software of such systems may be configured with, andare aware of, the current topology of the metro cluster configuration.For example, the management software such as denoted by the elements 411and 413 of FIGS. 7A and 7B know which hosts and data storage systems arelocal and included in the same data center, and which hosts and datastorage systems are remote and included in different data centers. Inthis manner, the management software components 411, 413 respectively ofthe systems 410, 430 may communicate and cooperate to appropriately setALUA path states and also ensure that both of the systems 410, 430report the same information to the hosts 412, 432 for the exposedvolumes or LUNs, such as the stretched LUN A configured from the volumepair 425 a, 425 b.

A stretched volume may be stretched between and among two data storagesystems included in a metro cluster configuration as described elsewhereherein, for example, such as in FIGS. 5 and 7A. More generally, a volumeor LUN may be stretched between and among more than two data storagesystems included in a metro cluster configuration. For example, withreference to FIG. 7C, the stretched volume A is configured from a firstvolume R1 LUN A 425 a on the system 410 and a second volume R2 LUN A″425 b on the system 430, where the volumes 425 a and 425 b areconfigured to have the same identity, “LUN A”, as presented to one ormore hosts (not shown for simplicity of illustration). As discussedabove such as in connection with FIG. 7A, the volumes 425 a-b may beconfigured for two way synchronous remote replication in order tosynchronize the content of the volumes 425 a-b to be mirrors of oneanother.

The foregoing concept of a stretched volume or LUN may be extended to athird data storage system, the data storage system 3 (DS3) 490, that mayalso be included in the same metro cluster configuration whereby a thirdvolume R3, LUN A* 425 c on the DS3 490 is also configured to have thesame identity as the volumes 425 a-b. In this manner, paths from the oneor more hosts to the third volume R3 425 c on the DS3 490 are similarlyviewed as additional paths to the same stretched volume or LUN. In suchan embodiment, the volumes 425 b-c may be configured to have two waysynchronous replication of writes in a manner similar to the volumes 425a-b. In at least one embodiment, processing may be performed to maintainmirrored identical content on the volumes 425 a-c in a synchronousmanner whereby writes applied to any one of the volumes 425 a-c may alsobe applied in a synchronous manner to the remaining ones of the volumes425 a-c. For example, a write may be received at the system 410 for thestretched volume copy 425 a. The write to the volume 425 a may besynchronously replicated to the system 430 and applied to the volume 425b, and also synchronously replicated from the system 430 to the system490 and applied to the volume 425 c.

In at least one embodiment, an acknowledgement may not be returned tothe host that sent the originating write to the system 410 until thesystem 410 receives an acknowledgement, directly or indirectly, thatboth the systems 430 and 490 have completed the write such as by storingthe write data in caches of the systems 430, 490. The example 480illustrates a daisy-chain like arrangement for the stretched volumeconfigured from the volumes 425 a-c with the same identity. In such anarrangement for synchronous replication, a write from a host may bereceived at the system 410. In response, the write may be synchronouslyreplicated from the system 410 to the system 430. The system 430receiving the write may then synchronously replicate the write from thesystem 430 to the system 490. In response to receiving the write, thesystem 490 may return a first acknowledgement to the system 430. Inresponse to receiving the first acknowledgement, the system 430 mayreturn a second acknowledgement to the system 410. In response toreceiving the second acknowledgement, the system 410 may then return athird acknowledgement to the host regarding completion of the writeoperation. Receiving this second acknowledgement notifies the system 410that the write has been successfully replicated and stored in thesystems 430 and 490. Other arrangements and configurations of stretchedvolumes across more than 2 data storage systems are also possible. Insuch other arrangements and configurations, the original data storagesystem 410 receiving the host write may only return an acknowledgment tothe host regarding completion of the received write once the system 410receives an acknowledgment, directly or indirectly, that all systemsconfigured in the stretched LUN or volume configuration, have receivedand stored the write in their respective systems.

In such embodiments, the stretched LUN or volume is generally configuredfrom M volume instances on M different data storage systems in a metrocluster configuration, where the M volume instances are configured asidentical volumes and recognized by the host as the same volume or LUN,and where M is equal to or greater than 2.

As discussed above, a stretched volume or LUN may be represented andconfigured as two or more volume or LUN instances located at differentphysical data storage systems. Generally, the two or more volumeinstances may be located in the same cluster or in different clusters.

A stretched LUN or volume may be used for any suitable purpose orapplication. For example, as discussed herein, a stretched volume or LUNmay be included in a metro cluster configuration where the remote volumeinstance is used by a host in case of a disaster or data unavailabilityof the local copy of the stretched volume. A stretched LUN may also beused, for example, for volume migration between appliances or datastorage systems in the same or possibly different clusters.

Generally, the multiple data storage systems including the multiplevolume instances configured as the same stretched volume aresynchronized in multiple aspects in order to have the multiple volumeinstances appear to the host as the same stretched volume or LUN. Forexample, as discussed above, data of the multiple volume instancesconfigured as the same stretched volume or LUN is synchronized. In suchimplementations with a stretched LUN or volume exposed to the host overmultiple paths from multiple data storage systems, the host may issue aread I/O command over any one of the multiple paths to read data from atarget logical address of the stretched volume. In response, the hostreceives the same read data independent of the particular one of themultiple paths over which the host sends the read I/O. In a similarmanner, a write I/O command issued over any one of the multiple paths tothe stretched volume or LUN results in all multiple volume instances ofthe stretched volume being updated with the data written by the writeI/O command.

In connection with a stretched volume or LUN, volume metadata (MD) alsoneeds to be synchronized between all volume instances configured as thesame stretched volume or LUN. The volume MD for a stretched LUN orvolume may generally include information that may be reported to a hostthat requests such information about the stretched volume or LUN.Consistent with the SCSI standard as well as other standards, variousmanagement or control path commands may be issued by a host to a datastorage system over a path over which the stretched LUN is exposed orvisible to the host. In such implementations with a stretched LUN orvolume exposed to the host over multiple paths from multiple datastorage systems, the host may issue the management command requestinginformation included in the MD about the stretched volume over any oneof the multiple paths. In response, the host receives the requestedinformation. Further, the same set of information regarding thestretched volume is sent to the host independent of the particular oneof the multiple paths over which the host sends the management command.Put another way, the host may send the management command requestinginformation about the stretched volume over any one of the multiplepaths exposing the stretched volume, where the same set of informationis returned to the host when the same management command is sent overany one of the multiple paths exposing the stretched LUN to the host.Additionally, any changes to the volume MD for the stretched volume madeto one copy of the volume MD on one data storage system also need to besynchronized with any other copy of the volume MD stored on another datastorage system. Thus, for example, a management command may be sent on afirst path to a first data storage system, where the management commandupdates a first copy of the stretched volume MD of the first datastorage system. Subsequently, the changes to the first copy of thestretched volume MD are also sent or applied to a second copy of thestretched volume MD of a second data storage system also exposing aconfigured instance of the stretched volume.

The volume MD may generally be included in management information storedin a management DB, such as the management (MGT) DBs 411 a-b of FIG. 7A.In a system in accordance with the SCSI standard, the volume MD mayinclude information regarding, for example, volume reservations, ALUApath states for paths over which the stretched volume is exposed to thehost, the particular target ports and TPGs over which the stretched LUNis exposed to the host, and other information that may vary withembodiment.

Developing and testing stretch volume synchronization between volumeinstances is expensive because running a test requires several physicaldata storage systems to allocate several different instances of thestretched volume. Test scenarios may be time consuming as well asdifficult and erroneous to configure. For example, it may take anundesirable amount of time with many steps to configure the multiplecopies or instances of the stretched volume at all the data storagesystems. Furthermore, such configuration may require the management orcontrol planes of all such systems to appropriately communicate with oneanother. Additionally, the multiple instances of the stretched volumeare configured for data replication as well as any requiredsynchronization of management information, such as the volume MD.

Thus configuration of a stretched volume across multiple data storagesystems may be characterized as time consuming, difficult andpotentially erroneous. Additional system resources are also used tomaintain the required synchronization, such as for data replication andsynchronization of volume MD or management information regarding thestretched volume or LUN. Furthermore, additional configuration may berequired to test the various scenarios desired for the stretched volumeor LUN.

In connection with configuring and testing stretched volume scenariosand uses, two different physical storage systems may be used as notedabove having the drawbacks and complexities noted above. As a variation,the two or more different storage systems including the multipleinstances configured as the same stretched volume may be virtualized.For example, the two data storage systems may be running as twovirtualized data storage systems in containers or virtual machines.However, this latter virtualized approach requires further logic tomanage the containers or virtual machines that further increases theconfiguration complexity and may further decrease performance.

The foregoing complexities and drawbacks encountered when configuring,running and validating stretched volume scenarios may result indecreased productivity and delays in connection with feature developmentand testing regarding stretched volume configurations.

Described in the following paragraphs are techniques that may be used tosimplify stretched volume development and testing by simulating astretched volume or LUN configuration. In at least one embodiment, thestretched volume or LUN may be configured from two copies or volumeinstances including a regular or normal volume, and a shadow volume. Theregular and shadow volumes may be included in the same single datastorage system operating in a simulation mode to simulate the stretchedvolume. The regular volume and the shadow volume configured as thesimulated stretched volume may be exposed to, and viewed by a host, as asame logical volume over paths from the single data storage system. Insuch an embodiment, the techniques provide for creating a pair ofvolumes in the same single data storage system or appliance to representthe local and remote volume instances configured as the same stretchedvolume or LUN. The pair of volumes on the same system may be linked toeach other in connection with a simulation mode.

In at least one embodiment, the target ports of the single data storagesystem simulating the stretched volume may be partitioned into twogroups. If a management command is received at a first of the twogroups, the command is assumed to be directed to the regular or normalvolume representing the local volume instance of the stretched volume.If a management command is received at a second of the two groups, thecommand is assumed to be directed to a shadow volume representing theremote volume instance or shadow volume of the stretched volume.

Generally, the data storage system does not allow creating orconfiguring the identical volumes as used in connection with thestretched volume or LUN. As such, in at least one embodiment, thetechniques herein use different unique identifiers (UIDs) as the LUN IDsfor the different volume instances of the simulated stretched volume. Asimulation mode for simulating the stretched volume in the single systemincludes a conversion algorithm and component that maps or switchesbetween the different UIDs linked to the same stretched volume. In atleast one embodiment in accordance with the SCSI standard, the UIDs maybe WWNs, where a first WWN may be used to identify the regular volumeand a second different WWN may be used to identify the shadow volume.The data storage system running in simulation mode links the foregoingfirst and second WWNs together and associates them both with the samestretched volume or LUN. A component, such as a UID switching logic orcomponent, may be used to map the first WWN to the second WWN, and alsomap the second WWN to the first WWN as may be needed in connection withsimulating the stretched volume configuration when processing managementcommands.

In at least one embodiment, the first WWN or other LUN ID associatedwith the regular volume may be exposed to the host as the LUN ID of thesimulated stretched volume configured from the regular volume and itsassociated counterpart shadow volume.

The foregoing and other aspects of the techniques herein are describedin more detail in the following paragraphs.

In the following paragraphs, the techniques herein may refer to a SCSIbased protocol, such as FC or iSCSI. However, the stretched volume andthe techniques described in the following paragraphs may also be used inembodiments using other suitable protocols such as, for example, NVMe.

In the following paragraphs, illustrative examples are provided in whicha stretched volume is configured from two volumes—a regular or normalvolume and a shadow volume. More generally, the techniques herein may befurther extended for use with a stretched volume configured from anysuitable number of configured volumes, M, on M different data storagesystems or appliances, where M is equal to or greater than 2. Note thatthis follows from the generalization of a stretched volume or LUNconfiguration as described elsewhere herein such as, for example, inconnection with FIG. 7C. In such embodiments, the stretched volume orLUN may be simulated using the regular or normal volume as describedherein along with M−1 shadow volumes in the single data storage system.Also, in such an embodiment, the total number of target ports of thesingle data storage system may be divided into M groups or partitions,where a different one of the M groups or partitions is associated with aparticular one of the M configured volumes for the stretched volume. Inthis manner, each of the M groups or partitions of target ports may beused to simulate paths from the host to one of the M data storagesystems or appliances including one of the M volumes configured for thesimulated stretched volume.

In the following paragraphs, the techniques are described in embodimentsin which a particular ALUA path state for a particular volume or LUN isapplied at the TPG level of granularity where all target ports in thesame TPG have the same ALUA path state. In this case, all target portsin the TPG over which a volume or LUN is exposed acquire the TPG ALUApath state. For example, setting a TPG to active optimized for anexposed LUN accordingly sets all target ports in the TPG to activeoptimized for the exposed LUN. As another example, setting a TPG toactive non optimized for the exposed LUN accordingly sets all targetports in the TPG to active non optimized for the exposed LUN. As avariation as also illustrated herein such as in connection with FIGS. 7Aand 7B, ALUA access path states may be set individually on a per pathbasis rather than at a TPG level. More generally, the techniques hereinmay be used in connection with embodiments which do not utilize ALUA anddo not utilize ALUA path states.

The techniques described in the following paragraphs may be used tosimulate a stretched volume or LUN that may be used in any suitableapplication some of which are described herein. For example, thetechniques described in the following paragraphs may be used to simulateand test a stretched volume configuration that may be used in a metrocluster configuration or a metro configuration. The techniques hereinmay be used to simulate and test a stretched volume configuration asdescribed in connection with FIGS. 7A and 7B where, for example, theALUA access path states change to simulate a disaster at a local orprimary data storage system. As another example, the techniques hereinmay be used to simulate and test a stretched volume configuration thatmay be used in connection with migrating data of the stretched volumefrom a first instance or volume to a second instance or volume. Themigration may be performed, for example, in connection with loadbalancing to relocate and migrate a volume between a simulated localsystem and simulated remote system.

Before further describing embodiments of the techniques herein forsimulating the stretched volume, presented is an initial discussion ofinformation that may be included in volume MD (in some contextssometimes referred to herein as simply MD) as well as various commandsthat may be used in connection with querying and modifying the volumeMD. Examples in the following paragraphs may be in accordance with aparticular protocol and standard, such as the SCSI protocol andstandard. However, other suitable protocols and standards, such as NVMe,may be used in connection with the techniques herein, wherein such otherprotocols and standard may have similar concepts, commands andinformation included in volume MD.

One example of volume MD includes SCSI reservation and registrationinformation. For example, SCSI-2 and SCSI-3 are versions of the SCSIstandard that support device registrations and reservations and havevarious commands that perform operations affecting device registrationsand reservations. For example, SCSI-3 has persistent reservation (PR)commands. Commands used in connection with reservation and registrationinformation may include commands that, for example, perform aregistration, read information regarding existing registrations, performa reservation, perform a clear operation to clear a reservation, performa release to release a reservation, and perform processing to preempt areservation.

SCSI PR uses the concepts of registrations and reservations. PRs allowmultiple hosts, or more generally multiple initiators, to communicatewith a target by tracking multiple initiator-to-target relationshipscalled I_T nexuses. An I_T nexus is a relationship between a specificSCSI initiator port (I) and a specific SCSI target port (T) for a givenLUN within the SCSI target. It should be noted that following examplesmay refer to SCSI PR commands such as in a SCSI-3 based system. However,similar commands and/or operations may be performed in other embodimentsbased on other versions of the SCSI standard which also affectreservation state information.

As a first step in setting up a PR, registration may be performed usingReservation Key, also more generally referred to herein as simply a“key”. A key may generally be any suitable value, such as a numericvalue. Each host system that participates registers a key with eachvolume or LUN over each path (e.g., each initiator (I) and target port(T) pairing) over which each particular volume or LUN is accessible tothe host. For example, with reference to FIG. 7A, the stretched LUN Amay be exposed to the host 1 412 over the 4 paths: I11-T11 422 a,I12-T13 324 a, I13-T31 423 b, and I14-T33 423 c, where the host 412 mayregister its key, K1, over each of the foregoing 4 paths to access thestretched LUN A. In a similar manner, the stretched LUN A may be exposedto the host 2 432 over the 4 paths: I31-T12 423 d, I32-T14 423 e,I33-T32 422 b and I34-T34 423 f, where the host 432 may register itskey, K2, over each of the foregoing 4 paths to access the stretched LUNA. Although each of the hosts 412, 432 are described for illustrationpurposes as using different keys, more generally, hosts may use the sameor different keys. In such a system where each host registers with adifferent key over all its own paths to the same LUN, all registrationshaving the same key may denote all paths from a particular host to theLUN.

As a result of the hosts 412, 432 each registering their respective keysover their respective 4 paths noted above, the data storage system 410may include the following first set of registration information for thestretched volume or LUN A of Table 1, and the data storage system 430may include the following second set of registration information for thestretched volume or LUN A of Table 2:

TABLE 1 Registration information for LUN A on the data storage system410 Volume/ LUN Key Init ID Target ID A K1 I11 T11 A K1 I12 T13 A K2 I31T12 A K2 I32 T14

TABLE 2 Registration information for LUN A on the data storage system430 Volume/ LUN Key Init ID Target ID A K1 I13 T31 A K1 I14 T33 A K2 I33T32 A K2 I34 T34

Processing may be performed to synchronize the volume MD for thestretched LUN A where the information of Table 1 is sent from the system410 to the system 430, whereby the system 430 updates the volume MD forthe stretched LUN A to include a combination of the information of theTables 1 and 2. Processing may be performed to synchronize the volume MDfor the stretched LUN A where the information of Table 2 is sent fromthe system 430 to the system 410, whereby the system 410 updates thevolume MD for the stretched LUN A to include a combination of theinformation of the Tables 1 and 2.

As a result of the volume MD synchronization, the collectiveregistration information for the stretched volume or LUN A as stored ineach of the MGT DBs 411 a-b, respectively, of the systems 410, and 430may include the following information as in the Table 3 below:

TABLE 3 Volume/ Row # LUN Key Init ID Target ID 1 A K1 I11 T11 2 A K1I12 T13 3 A K1 I13 T31 4 A K1 I14 T33 5 A K2 I31 T12 6 A K2 I32 T14 7 AK2 I33 T32 8 A K2 I34 T34

In at least one embodiment, registration of the PR keys as illustratedin the Tables 1, 2 and 3 may be performed by the hosts 412, 432 as partof discovery processing whereby various devices and connections visibleor accessible to the hosts are discovered. As part of host discoveryprocessing, each of the hosts may register a key for each LUN accessibleto the host over each path which the LUN is accessible. In an embodimentin which each host uses its own set of one or more keys, a ReservationKey may be registered for each I_T nexus (each I-T over which a LUN isaccessible to the initiator I) and includes the necessary information toallow the authentication of the I_T nexus devices in order to controlthe reservations.

The information in Table 3 may denote the aggregated or collective setof registration information included in the volume MD for the LUN A.

An embodiment in accordance with techniques herein may provide supportfor the PR IN command to read or query registration and reservationinformation included in the volume MD of the MGT DBs 411-b. It should benoted that an embodiment may include different command parameters withthe PR IN command to request and vary the particular informationprovided and returned to the requesting initiator. For example, a PR INcommand may include one or more parameters identifying the particularinformation to be returned. For example, the PR IN command may includeone or more parameters requesting to return a complete set of allregistration and reservation information of the databases, return onlyreservation information, return only registration (e.g., keyinformation), return only registration and/or reservation informationassociated with a particular key, and the like. To further illustrate,assume subsequent to issuing the 4 PR registration commands, the host412 issues a PR IN command over the path I11-T11 422 a to the system 410requesting a list of all existing or active registrations andreservations with respect to a particular LUN, such as the stretchedvolume or LUN A. Generally, the PR IN command is directed to aparticular LUN and issued over one of the paths (from initiator to atarget port) for which there is an existing registration for the LUN. Inresponse to receiving the PR IN command over the path 422 a for thevolume or LUN A, the system 410 may query its MG DB A 411 a for therequested reservation and registration information of the volume MD forthe stretched volume or LUN A. In response to the PR IN command, thesystem 410 may return the information as described above in Table 3. Ina similar manner, issuing the same PR IN command regarding LUN A overany of the 8 paths to the systems 410, 430 also results in returning thesame information as described in Table 3. It should be noted that ifthere were also existing reservations (described elsewhere herein) withrespect to LUN A, then information regarding such existing reservationsmay also be returned in response to the PR IN command described above.

In this manner, the requesting host 412 or initiator I11 may bepresented with a complete view of registration and reservationinformation with respect to all paths to the stretched volume or LUN Aacross both systems 410, 430 by issuing the PR IN command directed toLUN A over any of the 8 active paths exposing the stretched LUN A andbehave as if the 8 active paths to the stretched volume or LUN A are allon the same data storage system. This is consistent with discussionelsewhere herein where the host 412 has a view that the paths 422 a, 423a, 423 b and 423 c are 4 active paths to the same volume or LUN A, andwhere the host 432 has a view that the paths 423 d, 423 e, 422 b and 423f are 4 active paths to the same volume or LUN A even though there areboth primary and secondary copies 425 a-b of the stretched volume or LUNA configured in the metro cluster configuration on the two differentdata storage systems 410, 430.

Commands affecting or related to registrations and reservations, such asvarious ones of the PR commands, affect the ability of initiators andthus hosts to perform I/O with respect to different LUNs. For example,in connection with registrations with the SCSI standard, if there is noregistration with respect to a particular I-T nexus (e.g., initiator andtarget port for a particular LUN), that initiator may at most be able tohave read-only access to data of that LUN over the path from theinitiator to the target port. As described below in more detail, aninitiator may also issue other commands, such as a reservation command,which request a particular type of volume or LUN access and may block ormodify access allowed by other initiators and hosts. Such other commandsdescribed in more detail in the following paragraphs may result inmodifying or updating existing volume MD, such as for the stretched LUNA, whereby such modifications may also be synchronized among the systems410, 430 of the metro cluster configuration hosting copies 425 a-b ofthe stretched LUN A.

In at least one embodiment in accordance with the SCSI standard, a PRreserve or reservation command may be issued over a path from aparticular initiator to a particular target port and directed to a LUN(e.g. PR reservation may be made with respect to a particular LUN,initiator and target port). Additionally, the PR reserve or reservationcommand may include parameters such as, for example, a parameter thatdenotes a key of a previous PR registration, a parameter identifying anassociated type of I/O access for the requested reservation, andpossibly other parameters. For example, the type of I/O access parametermay be one of variety of different types of I/O access such as exclusiveaccess (whereby no other initiator besides the current reservationholder having the exclusive access is allowed to issue any I/Os to theLUN), write exclusive access (whereby only the initiator holding thecurrent reservation is allowed to issue writes but other initiators mayissue read I/Os), and the like. In at least one embodiment in accordancewith the SCSI standard, the PR reservation command may be included inthe broader category of PR OUT commands that generally change or modifyvolume MD associated with a particular volume or LUN.

To further illustrate, assume that the initiator I11 of the host 412issues a PR reservation command for the stretched LUN A over the pathI11-T11 422 a, where the PR reservation command requests write exclusiveaccess so that only the initiator I11 holding the current reservation isallowed to issue writes but other initiators may issue read I/Os. Inresponse to receiving the foregoing PR reservation command, the system410 may update the volume MD for the stretched LUN A as included in theMGT DB A 411 a to also include an existing reservation for I11 for writeexclusive access. Additionally, processing may be performed tosynchronize the volume MD for the LUN A of the MGT DB 411 a of thesystem 410 with corresponding volume MD for the LUN A in the MGT DB 411b of the system 430. For example, the system 410 may send thereservation for I11 for write exclusive access for LUN A to the system430, whereby the system 430 may accordingly update its local copy of thevolume MD for the LUN A in the MGT DB 411 b. Subsequently, anacknowledgement or response may be returned from the system 430 to thesystem 410, and then from the system 410 to the initiator I11 of thehost 412, where the acknowledgement or response indicates successfulcompletion of the PR reservation command requesting a reservation forI11 for write exclusive access to the LUN A.

Thus, in this manner, reservations from the data storage system 410(receiving the PR reservation command) may be mirrored on the remotedata storage system 430 in an atomic manner. Receiving a reservation onone path over which LUN A is accessible through a first data storagesystem results in replicating the reservation state across all pathsover which LUN A is accessible through a second data storage system.Once the reservation exclusive write access for I11 to the LUN Acompleted as described above, a subsequent write I/O, such as from thehost 432 over any of the paths 432 d, 432 e, 422 b and 423 f may resultin an error due to the existing reservation for the LUN A for theinitiator I11.

In at least one embodiment in accordance with the SCSI standard, othercommands that may modify volume MD may include a clear command that is asub-command of the PROUT command and may be issued to a particular LUNto release or clear the persistent reservation (if any) and clearregistrations for the particular LUN. In a similar manner as discussedherein in connection with other management commands that modify thevolume MD of the stretched volume or LUN A, any reservations andregistrations of the LUN A cleared on one of the systems 410, 430 (e.g.,receiving the PR clear command) may be mirrored on the other remote oneof the system 410, 430 in order to synchronize the volume MD for the LUNA across both systems 410, 430. Generally, depending on the particularembodiment, other parameters and criteria may be specified in connectionwith the clear command that affect the particular registrations and/orreservations cleared or removed for the LUN A.

In at least one embodiment in accordance with the SCSI standard, othercommands that may modify volume MD may include a release command thatreleases any active persistent reservation but does not remove theregistrations for a particular LUN. In connection with the SCSI-3standard, the release command is a sub-command of the PROUT command andis issued to a particular LUN to release or clear the persistentreservation (if any) from the LUN. In a similar manner as discussedherein in connection with other management commands that modify thevolume MD of the stretched volume or LUN A, any reservations of the LUNA released on one of the systems 410, 430 (e.g., receiving the PRrelease command) may be mirrored on the other remote one of the system410, 430 in order to synchronize the volume MD for the LUN A across bothsystems 410, 430. Generally, depending on the particular embodiment,other parameters and criteria may be specified in connection with therelease command that affect the particular registrations and/orreservations cleared or removed for the LUN A.

The foregoing are examples of some management commands in connectionwith the SCSI protocol and standard that may be used in connection withthe stretched volume or LUN, such as the LUN A. More generally, othermanagement commands may be supported and the particular examplesprovided herein are illustrative and not meant to be limiting.

As a further example of volume MD such as for the stretched LUN A,consider an embodiment in accordance with the ALUA standard utilizingthe ALUA path states or access path states as described herein. Theparticular ALUA path states with respect to a particular volume or LUNmay be included in volume MD for the volume or LUN. For example, theinformation in the tables 310 and 320 of FIG. 6 may be included in thevolume MD for the particular exposed LUNs at different points in time.As another example, the particular path states as described inconnection with the FIGS. 7A and 7B may be included in the volume MD forthe stretched LUN A at different points in time. In connection with thestretched LUN A of FIGS. 7A and 7B, the path state information regardingthe paths over which LUN A is exposed may be synchronized across thesystems 410 and 430. For example, any path state changes made by thesystem 410 regarding a path including any of the target ports T11-T14may be communicated to the system 430 so that both the system 410 and430 may have copies of the same set of ALUA path state information forthe 8 paths over which the stretched LUN A is exposed to the hosts 412,432.

In at least one embodiment in accordance with the SCSI standard,management commands such as a report target port group (RTPG) commandand an inquiry command may be issued by any of the hosts 412, 432 toreturn information regarding a particular LUN, such as the stretched LUNA configured from the volumes 425 a-b in the metro clusterconfiguration. In at least one embodiment, commands such as theforegoing issued by the hosts 412, 432 to the data storage systems 410,430 may result in reporting information about the requested LUN A. Theinformation returned and reported may identify the existing paths andassociated ALUA path states, TPGs and target ports over which the LUN A(e.g., volumes 425 a-b) is exposed to the hosts 412, 432. Theinformation returned in response to the RTPG command is generallydescribed elsewhere herein, for example, such as in connection with theFIGS. 7A and 7B above. In at least one embodiment, the informationreported or returned in response to such commands may omit any TPG andtarget ports for which there is no path (e.g., unavailable status) tothe LUN A. The RTPG command requesting information about LUN A may beissued to the systems 410, 430 over any of the paths over which the LUNA is exposed to the hosts 412, 432. In response, the systems 410, 430return the same set of collective information in response to a RTPG forLUN A issued over any of the paths exposing LUN A. The set of collectiveinformation returned over any of the paths exposing LUN A may includethe target ports exposing LUN A, target port groups exposing LUN A, andALUA path states for paths exposing LUN A.

The foregoing generally describes some of the management commands thatmay be issued in connection with a stretched LUN or volumeconfiguration. In at least one embodiment in accordance with thetechniques herein, the foregoing management commands as well as othersmay also be issued and processed using the simulated stretched volume orLUN. In this manner, the simulated stretched volume or LUN described inmore detail below may be used in testing and development of themanagement commands.

In at least one embodiment in accordance with the techniques herein, thesimulated stretched volume or LUN may be used in testing and developmentof volume MD synchronization for management commands.

In at least one embodiment, stretched volume development and testing maybe performed using a simulated stretched volume or LUN created from apair of volume objects in the same data storage system. The pair ofvolume objects may represent, respectively, the local and remoteportions of the stretched volume as two LUNs or volumes that are linkedto one another in simulation mode.

In at least one embodiment, one of the volumes of the pair may beselected for processing a management command depending on the particulartarget port selected by a host initiating the management command. In atleast one embodiment, the management command may be any of themanagement commands described herein as well as others that may besupported in connection with the simulated stretche

In at least one embodiment, the MGT DB of the single data storage systemmay store volume MD for each volume or LUN of the system. In connectionwith the simulated stretched volume represented by the pair of volumes,the MGT DB may include an individual record or entry for each of thevolumes of the pair. A first volume V1 of the pair may be a normal orregular volume corresponding to the local copy of the stretched volumeon the local data storage system. A second volume V2 of the pair may bea shadow volume corresponding to the remote copy of the stretched volumeon the remote data storage system. The MGT DB may store a first entry orrecord for V1, and a separate second entry or record for V2 therebyseparating the local and remote metadata of the stretched volume.

In at least one embodiment, for each simulated stretched volume, anormal or regular volume V1 may be created having an identity or LUN ID,such as a WWN in accordance with the SCSI protocol, that is exposed tothe external host(s). Additionally, for each simulated stretched volume,a shadow volume V2 is also created. The different copies of the volumeMD for V1 and V2 may be maintained using the separate entries or recordsof the MGT DB associated with each of the volumes V1 and V2. The hostmay view both the regular volume V1 exposed over a first path and theshadow volume V2 exposed over a second path as the same LUN or volumehaving the same identity, such as the same WWN.

In at least one embodiment, V1 may have a first identifier, such as afirst WWN, used to uniquely identify V1 with respect to all othervolumes or LUNs. V2 may have a second identifier, such as second WWN,used to uniquely identify V2 with respect to all other volumes or LUNs.In at least one embodiment, the first identifier may be used as an indexinto a table of the MGT DB to obtain a first entry or record includingvolume MD for V1, and the second identifier may be used as an index intoa table of the MGT DB to obtain a second entry or record includingvolume MD for V2. In at least one embodiment, the contents of the firstand second entries or records may be the same other than the differentLUN IDs, such as the different WWNs, assigned to V1 and V2. In at leastone embodiment in accordance with the SCSI standard, the host may viewthe regular volume and the shadow volume as having the same identity,such as having the same first identifier such as the first WWN.

In at least one embodiment, the target ports of the single data storagesystem may be partitioned into two logical groups for simulating paths,respectively, to the local and remote data storage systems, where themanagement commands may be issued over such paths.

In at least one embodiment, a UID switching logic or component(sometimes referred to herein as a UID switch) may be used to mapbetween the different LUN IDs, such as the different WWNs, associatedwith the normal or regular and shadow volumes of the stretched volume.In at least one embodiment as noted above, the regular volume may havethe first WWN and the shadow volume may have the second WWN differentfrom the first WWN. The regular volume exposed over a first path and theshadow volume exposed over a second path may both be presented to thehost as the same stretched logical volume having the same identity, suchas, for example, where both the regular and shadow volumes have thefirst WWN. When the host issues a first management command over thefirst path to a first target port in a first partition of target portsof a data storage system where the first management command is directedto the simulated stretched volume with the first WWN, the data storagesystem simulates the local data storage system servicing the firstmanagement command. When the host issues a second management commandover the second path to a second target port in a second partition oftarget ports of the data storage system where the second managementcommand is directed to the simulated stretched volume with the firstWWN, the data storage system simulates the remote data storage systemservicing the second management command.

In at least one embodiment, the regular or normal volume of the pair ofvolumes used to simulate the stretched volume may be used for servicingI/O commands, such as read and write commands that, respectively, readdata from and write data to the stretched volume.

In at least one embodiment, the normal or regular volume V1 simulatesthe stretched volume copy in the local data storage system, and theshadow volume V2 simulates the stretched volume copy in the remote datastorage system.

Referring to FIG. 8 , shown is an example 500 of the components and dataflow in at least one embodiment in accordance with the techniquesherein. The example 500 includes a single data storage system 530 and ahost 540. The system 530 may be, for example, a dual node appliance ordata storage system described elsewhere herein such as illustrated inFIG. 2 . The host 540 may be a host including other components asdescribed herein.

The example 500 illustrates an example of a simulated stretched volumeor LUN configured using the regular volume V1 541 and the shadow volumeV2 542. As discussed elsewhere herein, the volume V1 represents the copyof the stretched volume on the local data storage system, and the volumeV2 represents the copy of the stretched volume on the remote datastorage system.

The example 500 includes a BE data store 514 corresponding to the BE PDsof the data storage system 530. Consistent with other discussion herein,the BE data store 514 may include the PDs used to provide BEnon-volatile storage for the volumes or LUNs of the system 530. The MGTDB 516 may include information used in connection with the managementcommands. The MGT DB 516 may include, for example, data storage systemconfiguration information and volume MD for the volumes or LUNs of thesystem 530.

In this example 500, the system 530 is illustrated as including only theTPGs 510, 512, where the TPG 510 includes the target ports T1 510 a andT2 510 b, and where the TPG 512 includes the target ports T3 512 a andT4 512 b. In this example, the TPG 510 may denote the target ports ofthe node A of the system 530, and the TPG 512 may denote the targetports of the node B of the system 530. More generally, the system 530may include any suitable number of TPGs each with any suitable number oftarget ports. The host 540 includes the initiators I1 and I2. Moregenerally, the host 540 may include any suitable number of initiators.In the example 500, there are following 4 paths from the host 540 to thesystem 530: the path I1-T1 532 a, the path I2-T2 532 b, the path I1-T3534 a, and the path I2-T4 534 b.

The element 531 denotes a key of various data flow arrows or pathswithin the system 530, where the dashed line paths 502 a-c denoteportions of the processing flow in connection with the I/O or data path530 a. The dotted line paths 504 a-j denote portions of the processingflow in connection with the management or control path 530 b whenprocessing a management command.

Collectively, the target ports of the TPGs 510 and 512 may denote thepartitioning of all the target ports of the system 530 into two logicalgroups. The TPG 510 may represent the simulated target ports of thelocal data storage system. The paths 532 a-b to the TPG 510 may denotethe simulated paths 532 to the local data storage system. The TPG 512may represent the simulated target ports of the remote data storagesystem. The paths 534 a-b to the TPG 512 may denote the simulated paths534 to the remote data storage system.

The UID switches 520, 522 may denote switching logic of components thatmap between the various UIDs of the volumes linked to the same simulatedstretched volume. For example, consistent with discussion above, the UIDswitches 520, 522 may take as an input a first UID and map the first UIDto a second UID, where the first UID and the second UID may be UIDs orunique LUN IDs, such as WWNs, associated with V1 and V2. In thisparticular example, the UID switches 520, 522 generally toggle betweenthe two different UIDs. In other words, if a UID switch instance isprovided the WWN1 or UID of V1, then the UID switch instance outputs theWWN2 or UID of V2. Similarly, if a UID switch instance is provided theWWN2 or UID of V2, then the UID switch instance outputs the WWN1 or UIDof V1.

The elements 524, 530 and 526 generally represent a simulated connectionbetween the simulated local and remote storage systems. Consistent withother discussion herein, stretched volume logic assumes a communicationconnection between the local and remote systems. The connection may beused in a metro cluster configuration to also synchronize managementdata such as the volume MD. In this manner when in non-simulation mode,the connection may be used to synchronized volume MD and possibly othermanagement information maintained on the local and remote system for astretched volume. In simulation mode, the connection between the localand remote systems may be configured as illustrated in FIG. 8 as aloopback to the single data storage system 530 to simulate theconnection between the local and remote systems. In other words, insimilar mode, the source system and the destination or target system ofthe connection are the same single data storage system 530. Theconnection denoted by 524, 530 and 536 may be used to simulatesynchronize volume MD, for example, where a management command modifiesvolume MD of the stretched volume. In one aspect as discussed in moredetail below, the connection denoted by 524, 530 and 526 may be used tosimulated replicating the management command between local and remotedata storage systems.

In at least one embodiment, a volume creation operation may be performedto create a simulated stretched volume. Creating the simulated stretchedvolume in this example creates a regular volume object and additionallycreates a shadow volume object. The regular volume object represents thevolume and its metadata if the system 530 simulates (e.g., plays therole of) the local data storage system for management commandprocessing. The shadow volume object represents the remote copy of thesame volume and its metadata if the system 530 simulates (e.g., playsthe role of) the remote data storage system for management commandprocessing. The element 541 denote the regular volume V1, and theelement 542 may represent the shadow volume V2.

In at least one embodiment, the storage system 530 may not allowcreating two volumes, such as the regular volume V1 and the shadowvolume V2, with the same UID, such as the same WWN. For this reason, theregular volume V1 may be configured with a first UID such as the firstWWN, and the shadow volume may be configured with a different secondUID, such as the second WWN. For purposes of illustration, assume thatthe regular volume V1 is configured with a WWN1=15 and the shadow volumeV2 is configured with a WWN2=16. The simulation mode implements aconversion algorithm in the UID switches 520, 522, to map or switch fromthe normal volume UID to the shadow volume UID, and also to map orswitch from the shadow volume UID to the normal volume UID. The use ofthe UID switches facilitates mapping between the different UIDs of theshadow and regular volumes. For example, as discussed in more detailbelow, a host may provide the UID such as the WWN1=15 in a managementcommand and the UID switch may be used to map the WWN1=15 to WWN2=16,the remote counterpart shadow volume's UID.

In at least one embodiment, the targets representing a volume connectionto a host may be partitioned into 2 groups. For example, the targets maybe the target ports of the system 530. If the host issues a managementcommand (e.g., a SCSI or NVMe reservation command, a LUN reset command,or a target reset command) to a target from the first group 510 then thecommand is considered as a command to a first local data storage systemprocessed by normal volume/normal volume object. If the host uses atarget from the second group 512, then the command is considered as acommand to a second remote data storage system and processed by shadowvolume/shadow volume object.

The stretched volume logic assumes communication between local andremote systems to synchronize volume MD between the systems. Insimulation mode, the local to remote connection of the system 530 isconfigured as a connection or loopback to the system 530, where thesource and destinations of the connection are both on the system 530.Using the loopback connection when in the simulation mode, any metadatasynchronization command sent over the connection is received by the samesystem 530. The UID switch logic selects the normal or shadow volumeobject to process the management command depending on the role (e.g.,local or remote) when processing the command.

In at least one embodiment, all I/O commands of the data or I/O path maybe processed by the regular volume independently of the particulartarget used for the command. This avoids synchronization or replicationof data between the normal and shadow volumes since the same backenddata storage of the regular volume 541 is used for servicing I/Ocommands when the system 530 simulates both the local and remote systemroles. In such an embodiment, simulator mode overhead may be minimizedresulting in increased performance.

Continuing with the example above, assume that the regular volume V1 541is configured with a first UID of WWN1=15, and the shadow volume V2 542is configured with a second UID WWN2=16. In at least one embodiment, thevolumes V1 and V2 may be presented and exposed to the host 540 over thepaths 532 a-b and 534 a-b as the same stretched volume or LUN having thesame WWN1=15. The host 540 may issue management commands over any of thepaths 532 a-b, 534 a-b using the WWN1=15. For example, the host 540 mayissue a management command such as RTPGs over any of the paths 532 a-b,534 a-b requesting information for the stretched volume with the UID orWWN1=15 and receive the same set of information in return.

In at least one embodiment, the MGT DB 516 may include a table or otherstructure of information regarding existing volumes or LUNs indexed bythe UIDs, such as the WWNs, of the volumes. For example, reference ismade to the example 600 of FIG. 9A where the MGT DB 516 may include MGTDB records 602, comprising a first entry or record of MD 602 a for theregular volume V1 that is indexed by the WWN1=15 identifying the regularvolume, and a second entry or record of MD 602 b for the shadow volumeV2 that is indexed by the WWN2=16 identifying the shadow volume.

The table 610 illustrates in more detail information that may beincluded in the record 602 a for the regular volume where theinformation in 610 may be reported in response to the RTPG command forthe simulated stretched volume configuration of FIG. 8 . In particular,as discussed in more detail below in connection with FIG. 8 , theinformation of 610 may be used and returned in response to issuing theRTPG command directed to the UID for the WWN1=15 for the simulatedstretched volume over either of the paths 532 a-b simulating paths tothe local data storage system. The table 610 includes a row 612identifying the TPGs of the system 530, a row 614 identifying the targetports of the system 530 and a row 616 identifying the ALUA path states.In particular, the column 610 a indicates that the target ports T1 andT2 are included in the TPG1 and have the active optimized path state;and the column 610 b indicates that the target ports T3 and T4 areincluded in the TPG2 and have the active non optimized path state.

The table 620 illustrates in more detail information that may beincluded in the record 602 b for the shadow volume where the informationin 620 may be reported in response to the RTPG command for the simulatedstretched volume configuration of FIG. 8 . In particular, as discussedin more detail below in connection with FIG. 8 , the information of 620may be used and returned in response to issuing the RTPG commanddirected to the UID for the WWN1=15 for the simulated stretched volumeover either of the paths 534 a-b simulating the paths to the remote datastorage system. The table 620 includes a row 622 identifying the TPGs ofthe system 530, a row 624 identifying the target ports of the system 530and a row 626 identifying the ALUA path states. In particular, thecolumn 620 a indicates that the target ports T1 and T2 are included inthe TPG1 and have the active optimized path state; and the column 620 bindicates that the target ports T3 and T4 are included in the TPG2 andhave the active non optimized path state.

Referring back to FIG. 8 , assume a first RTPG command with UID for theWWN1=15 for the simulated stretched volume is issued over the path 532 afrom the host 540 to the system 530. In this case, the target port ofthe path 532 a is T1 510 a that is included in the TPG1 510 associatedwith simulating paths to the local data storage system. The managementcommand, RTPG, reads or queries information about the requested volume.Since the first RTPG command is received at T1, the RTPG command isprocessed 504 a by the regular volume object V1 541 where the RTPGcommand is then serviced by querying 504 b the MGT DB 516 for therequested information from the record 602 a of volume MD for the UIDWWN1=15. The requested information is then returned along the returnpath 504 b, 541, 504 a 532 a to the host 540. The requested informationreturned in response to the first RTPG command may be the information inthe table 610.

Assume a second RTPG command with UID for the WWN1=15 for the simulatedstretched volume is issued over the path 534 a from the host 540 to thesystem 530. In this case, the target port of the path 534 a is T3 512 athat is included in the TPG2 512 associated with simulating paths to theremote data storage system. The management command, RTPG, reads orqueries information about the requested volume. Since the second RTPGcommand is received at T3, processing of the command flows (504 e) fromthe TPG2 512 to the UID switch 520 that maps the UID of the regularvolume, WWN1=15, to the UID of the shadow volume, WWN2=16. In thismanner, the second RPTG command is processed 504 f by the shadow volumeobject V2 542, where the RTPG command is then serviced by querying 504 gthe MGT DB 516 for the requested information from the record 602 b ofvolume MD for the UID WWN2=16. The requested information is thenreturned along the return path 504 g, 542, 504 f, 520, 504 e, and 534 ato the host 540. The requested information returned in response to thesecond RTPG command may be the information in the table 620.

In connection with the above examples of FIG. 8 , the simulatedstretched volume may be presented to the host 540 as having a UID orWWN1=15 of the regular volume 541. Thus, the host may issue bothmanagement commands and I/O commands over the paths 532 a-b, 534 a-bdirected to the stretched volume with the WWN of 15. If the host 540issues requests or commands directed to the volume with the WWN=15 overpaths to the target ports of TPG1 510, then the system 530 treats thecommands or requests as if issued to the local data storage system andthe regular volume V1. Otherwise, if the host issues requests orcommands to directed to the volume with the WWN=15 over paths to thetarget ports of TPG2 512, then the system 530 treats the commands orrequests as if issued to the remote data storage system and the shadowvolume V2. In simulation mode, the system 530 provides for mapping theregular volume's WWN=15 to the shadow volume's WWN=16 when the commandsor requests are received at a target port of the TPG2 512. Inparticular, the UID switch 520 provides for mapping or substituting thereceived WWN of 15 for the regular volume with the WWN=16 of the shadowvolume.

As another example, consider a management command that modifies orupdates the volume MD information of the simulated stretched volume inthe configuration of FIG. 8 . For example, assume that the host 540registers its key K1 by issuing 4 reservation commands to the system 530over all 4 paths 532 a-b and 534 a-b. Assume that there are noregistrations or reservations prior to issuing the 4 foregoing 4reservation commands. Reference is made to FIG. 9B illustrating the MGTDB records 652 of the MGT DB 516. The MGT DB records 652 include a firstentry or record of MD 652 a for the regular volume V1 that is indexed bythe WWN1=15 identifying the regular volume, and a second entry or recordof MD 652 b for the shadow volume V2 that is indexed by the WWN2=16identifying the shadow volume. At the start of this example, assume thatthe records 652 a-b do not include the information denoted by the tables660-670.

Subsequently, a first point in time T1, the host 540 may issue a firstregistration command, REG1, to register its key K1 on the path 532 a forthe stretched volume with the UID WWN1=15. In this case, the target portof the path 532 a is T1 510 a that is included in the TPG1 510associated with simulating paths to the local data storage system. Themanagement command, REG1, modifies information about the requestedvolume. Since the REG1 command is received at T1, the REG1 command isprocessed 504 a by the regular volume object V1 541 where the REG1command is then serviced by updating 504 b the record 652 a (FIG. 9B) ofvolume MD for the UID WWN1=15. In particular, the table 660 of FIG. 9Bmay be updated to register the path I1-T1 with the key K1 as denoted bythe element 661 a of the row 661.

Additionally, processing may be performed to simulate synchronizing thevolume MD for remote counterpart of the stretched volume on the remotedata storage system. In particular, processing may be performed tosimulate replicating the REG1 command with respect to the shadow volumeon the remote data storage system. As noted above, in simulation modethe connection used to configured as a loopback connection from thesystem 530 to itself (e.g., the source and destination of the connectionare both the same system 530). The management command to synchronize orupdate the remote copy of the volume MD is issued (504 c) to the localto remote connection client 524 over the connection 530 to the local toremote connection server 526, where both 524 and 526 are in the samesystem 530. The command REG1 currently specifies the UID WWN1=15 of thenormal volume 541. As denoted by the processing arrow 504 j, processingproceeds from the connection server 526 to the UID switch 522, where theUID of the command REG1 is mapped from the WWN1=15 of the regular volumeto the UID WWN2=16 of the shadow volume. As denoted by the processingarrow 504 h, control proceeds from the UID switch 522 where the REG1command with the converted UID WWN2=16 is processed using the shadowvolume object 542. As denoted by the processing arrow 504 g, the WWN2=16of the shadow volume is used as an index to identify the record 652 b tobe updated to include the registration information 6711 of the row 671of the table 670. Processing of the management path then provides forreturning along the return path 504 g, 542, 504 h, 522, 504 j, 526, 530,524, 504 c, 541, 504 a, and 532 a to the host 540.

In a manner similar to that as described above for processing the firstregistration command REG1, processing may also be performed at a secondsubsequent point in time T2 to process a second registration command,REG2, issued by the host 540 to register its key K1 on the path 532 bfor the stretched volume with the UID WWN1=15. As a result of servicingthe management command REG2, the registration information in the row 662of the table 660 of FIG. 9B may be populated, and the registrationinformation in the row 672 of the table 670 of FIG. 9B may be populated.

At a third point in time T3 subsequent to T2, the host 540 may issue athird registration command, REG3, to register its key K1 on the path 534a for the stretched volume with the UID WWN1=15. In this case, thetarget port of the path 534 a is T3 512 a that is included in the TPG2512 associated with simulating paths to the remote data storage system.The management command, REG3, modifies information about the requestedvolume. Since the REG3 command is received at T3, command processingflows 504 e to the UID switch 520 that maps the WWN1=15 of the regularvolume to the WWN2=16 of the shadow volume. As denoted by the processingarrow 504 f, processing flows from the UID switch 520 to the shadowvolume object 542 where the REG3 command is then serviced by updating504 g the record 652 b (FIG. 9B) of volume MD for the UID WWN1=15. Inparticular, the table 670 of FIG. 9B may be updated to register the pathI1-T3 with the key K1 as denoted by the row 673.

Additionally, processing may be performed to simulate synchronizing thevolume MD for regular volume on the local data storage system. Inparticular, processing may be performed to simulate replicating the REG3command with respect to the regular volume on the local data storagesystem. Data flow associated with this volume MD synchronizationprocessing is represented by proceeding (504 h) from the shadow volumeobject 542 to the UID switch 522, where the UID WWN2=16 is mapped by theUID switch 522 to the UID WWN1=15 of the regular volume. From the switch522, the REG3 command with the UID WWN1=15 of the regular volume istransmitted (504 i) to the local to remote connection client 524, overthe connection 530, to the local to remote connection server 526, andthen processed (504 d) by the regular volume object 541.

As noted above, in simulation mode the connection used to configured asa loopback connection from the system 530 to itself (e.g., the sourceand destination of the connection are both the same system 530). Themanagement command to synchronize or update the copy of the volume MDassociated with the regular volume is issued (504 i) to the local toremote connection client 524 over the connection 530 to the local toremote connection server 526, where both 524 and 526 are in the samesystem 530. The command REG3 currently specifies the UID WWN1=15 of thenormal volume 541. As denoted by the processing arrow 504 d, processingproceeds from the connection server 526 to the regular volume object541. As denoted by the processing arrow 504 b, the WWN1=15 of theregular volume is used as an index to identify the record 652 a to beupdated to include the registration information of the row 663 of thetable 660. Processing of the management path then provides for returningalong the return path 504 b, 541, 504 d, 530, 524, 504 i, 522, 504 h,542, 504 f, 520, 504 e and 534 a to the host 540.

In a manner similar to that as described above for processing the thirdregistration command REG3, processing may also be performed at a fourthsubsequent point in time T4 to process a fourth registration command,REG4, issued by the host 540 to register its key K1 on the path 534 bfor the stretched volume with the UID WWN1=15. As a result of servicingthe management command REG4, the registration information in the row 674of the table 670 of FIG. 9B may be populated, and as a result ofsimulating volume MD synchronization over the connection 530, theregistration information in the row 664 of the table 660 of FIG. 9B maybe populated.

Assume now at a fifth point in time T5 subsequent to T4, a reservationcommand RES1 is issued by the host 540 over the path 532 a for thestretched volume with the UID WWN1=15. The command RES1 may, forexample, request write exclusive access for the initiator I1, wherebyonly the initiator I1 holding the current reservation is allowed toissue writes but other initiators may issue read I/Os. The RES1 commandmay be serviced using a processing flow similar to that discussed above,for example, in connection with processing the first registrationcommand REG1 with respect to processing flow described in connectionwith FIG. 8 . As a result of servicing the RES1 command, a reservationis made as denoted by the element 661 b of the table 660 and the element671 b of the table 670.

As illustrated by the volume MD in FIG. 9B, after synchronization of thevolume MD of the regular volume and the shadow volume, both entries 652a-b contain the same information but for different UIDs or WWNs for thetwo volumes.

At a sixth point in time T6 subsequent to T5, assume the host 540 issuesa management command directed to the WWN1=15 of the simulated stretchedvolume to read existing reservation and registration information, wherecommand may be sent over any of the path 532 a-b, 534 a-b over whichstretched volume is exposed. Consistent with discussion above regardingthe RTPG command, if the management command to read existingregistrations and reservations is sent over any of the paths 532 a-b,the volume MD 660 for regular volume is to returned to the host (e.g.,data processing flow with reference to FIG. 8 is 532 a or 532 b, 510,504 a, 541 and 504 b to obtain the requested information from the MGT DB516; a return path processing flow that is the reverse traversal of theforegoing data processing flow). Consistent with discussion aboveregarding the RTPG command, if the management command to read existingregistrations and reservations is sent over any of the paths 534 a-b,the volume MD 670 for shadow volume is returned to the host (e.g., dataprocessing flow with reference to FIG. 8 is 534 a or 534 b, 512, 504 e,520, 504 f, 542 and 504 g to obtain the requested information from theMGT DB 516; and a return path processing flow that is the reversetraversal of the foregoing data processing flow). Additionally, theinformation returned in response to the RTPG command directed to thesimulated stretch volume is the same independent of the particular path(exposing the stretched volume) that is used to send the RTPG to thesystem 530.

In at least one embodiment with reference to FIG. 8 , all I/O commands,such as read and write I/O commands, may be processed by the normalvolume 541 independent of which target port receives the I/O command. Toillustrate, I/Os sent over the paths 532 a-b are received at the TPG510. As denoted by the arrow 502 a, the I/Os are sent from the TPG 510to the regular volume 541 and then serviced (502 b) using the data fromthe BE data storage 514. In connection with the I/O commands, any datamay be returned along with an acknowledgement using a return path thatis the reverse traversal of the incoming I/O command path justdescribed.

Additionally, I/Os sent over the paths 534 a-b are received at the TPG512. As denoted by the arrow 502 c, the I/Os are sent from the TPG 512to the regular volume 541 and then serviced (502 b) using the data fromthe BE data storage 514. In connection with the I/O commands, any datamay be returned along with an acknowledgement using a return path thatis the reverse traversal of the incoming I/O command path justdescribed.

As described above such as in connection with FIG. 9B, the techniquesherein provide for simulating MD synchronization via the two MD recordsor entries for the regular volume and its shadow volume. The techniquesdescribed above provide for generally simulating a stretched volume andprocessing of I/O or data paths commands as well as control ormanagement commands directed to a simulated stretched volume.

For testing purposes, testing may include validating the correct MDstored in the entries or records of the MGT DB for the regular volumeand its associated shadow volume. The validation regarding thecorrectness of the synchronized MD entries for the volume and itsassociated shadow volume may be performed using any suitable technique.For example, management commands may be issued from the host along thepaths 532 a and 534 a to respectively read the MD of the two entries,652 a and 652 b, and then compare the information returned to ensurethat both commands report the same information. For example, a firstcommand to read registration and reservation information for thestretched volume with the UID of WWN1=15 may be issued to T1 or T2 ofthe TPG1 510 to return a first set of information; and a second commandto read the registration and reservation information for the stretchedvolume with the UID WWN1=15 may be issued to T3 or T4 of the TPG2 512 toreturn a second set of information. The first and the second set ofinformation may be compared to ensure that first and second sets ofinformation match whereby both the first and second commands return thesame registration and reservation information denoting the consistencyand synchronization of the MD entries for the regular volume V1 andshadow volume V2 of the simulated stretched volume.

Another way in which the MD of the two entries for the regular andshadow volumes of the DB may be validated is by reading the two entries,such as 652 a-b, directly from the MGT DB on the data storage system andcomparing the content of the two entries. The two entries are expectedto be identical but for the different WWNs for the regular volume andits shadow volume. If the two entries include the same information otherthan the different WWNs, then validation is successful.

In at least one embodiment, the connection 530 used for volume MDsynchronization and simulating MD synchronization may be a socket-basedconnection. In at least one embodiment, any message sent may becharacterized as being sent from a local client to a remote server. Inthis context, either the simulated local system or the simulated remotesystem can be the client or the server depending on which simulatedsystem sends the message over the connection. In at least oneembodiment, the connection 530 may be a TCP connection between a clientand a server. In simulation mode, the connection has a source IP addressof the client and a destination IP address of the server, where thesource IP address and the destination IP address are the same, or moregenerally, both on the same single data storage system. In contrast, inan actual metro configuration (e.g., non-simulation mode) where the TCPconnection is used to synchronize MD between a local data storage systemand a remote data storage system, the connection has a source IP addressof the client and a destination IP address of the server, where thesource IP address and the destination IP address are different andrespectively denote the IP addresses of the first client data storagesystem sending the message and the second server data storage systemreceiving the message.

The foregoing description provides examples of simulated stretchedvolume configurations including a regular volume and a single shadowvolume. More generally, the techniques described herein may be moregenerally applied in connection with a simulated stretched volumeconfiguration of M volumes where M is an integer that is two or more. Itis straightforward to extend the techniques herein for use with morethan two volumes configured as a simulated stretched volume to simulate,for example, a stretched volume configured from 3 volumes as illustratedin FIG. 7C. For example, with 3 volumes included in 3 different datastorage systems, the simulated stretched volume configuration includes aregular volume and two shadow volumes configured in the same single datastorage system. With 3 volumes, the target ports (or more generallytargets) of the single system 530 may be partitioned into 3 groups orsets where a different one of the groups or sets is associated withsimulating paths to a different one of the 3 data storage systems. Eachof the regular volumes and the 2 shadow volumes is assigned a differentLUN ID or UID, such as a different WWN. In this case, UID switches mapsUIDs of the shadow volumes to the UID of the regular volume, and alsomaps the UID of the regular volume to the UIDs of the shadow volume. Forexample, with reference to FIG. 8 , the UID switches 520, 522 mapbetween the UID of the regular volume and the UID of the shadow volumein connection with processing management requests received at the TPG2512. In a similar manner, additional UID switches may provide formapping between the UID of the regular volume and the UID of a secondshadow volume exposed over a third group of target ports of the system530.

Referring to FIGS. 10A-10C, shown are flowcharts of processing stepsthat may be performed in an embodiment in accordance with the techniquesherein. The flowcharts 700, 740 and 760 summarize processing describedabove.

At the step 702, a command may be issued to create a volume or LUN thatis a simulated stretched volume configured using a volume pair V1, V2.V1 may be a normal or regular volume exposed to the host. V2 may beconfigured as a shadow volume of V1. V2 may be a simulated remoteinstance of the stretched volume or LUN. V1 and V2 may be configured tohave the same logical unit number but may be configured to each havedifferent LUN IDs such as different WWNs. The regular and shadow volumesare configured in the same single data storage system. From the step702, control proceeds to the step 704.

At the step 704, processing may be performed to create a first record orentry in the MGT DB for V1 and a separate second record or entry in theMGT DB for V2. The first and second records may contain the sameinformation other than different LUN IDs, such as different WWNs,uniquely used to identify different volume or LUN instances of theregular and shadow volumes. The first record represents the stretchedvolume MD on the local data storage system for V1, and the second recordrepresents the stretched volume MD on the remote data storage system forV2. A regular or normal volume object represents V1 and the first recorddenoting, respectively, the stretched volume and its MD when the datastorage system simulates the role of the local data storage system. Ashadow volume object represents V2 and the second record denoting,respectively, the remote counterpart of the stretched volume and its MDwhen the data storage system simulates the role of the remote datastorage system. The LUN ID, such as the WWN, of V2 may be configuredbased on specified rules of a defined conversion or mapping process thatmaps a first WWN1 of V1 to a second WWN2 of V2. In a similar manner,WWN1 of V1 may be determined from WWN2 of V2 based on specified rules ofa defined conversion or mapping process that maps the second WWN2 of V2to the first WWN1 of V1. From the step 704, control proceeds to the step706.

At the step 706, data storage system targets, such as target ports, overwhich the stretched volume is exposed are partitioned into two groups.The two groups may be non-overlapping having no intersection or commontarget ports in both groups. A first group 1 of the target ports, TPG1,may be associated with simulating the local data storage system andpaths between the host and the local data storage system. If the hostissues a management command to a target port in TPG1, the command isconsidered as a command to the local data storage system and isprocessed using the normal volume object. A second group 2 of the targetports, TPG2, may be associated with simulating the remote data storagesystem and paths between the host and the remote data storage system. Ifthe host issues a management command to a target port in TPG2, thecommand is considered as a command to the remote data storage system andis processed using the shadow volume object. From the step 706, controlproceeds to the step 708.

At the step 708, an I/O command directed to the stretched volume may beissued over a path from the host to the data storage system. The pathmay be any path from the TPG1 or the TPG2, where the I/O command isserviced using the regular volume. An acknowledgement or response may bereturned to the host along with any requested data. The I/O command maybe directed to the UUID, such as the WWN, of the regular volume. Fromthe step 708, control proceeds to the step 710.

At the step 710, a management command directed to the stretched volumemay be issued over a path from the host to the data storage system. Themanagement command may identify the stretched volume using the WWN1 orother UID of the regular volume. From the step 710, control proceeds tothe step 712.

At the step 712, a determination is made as to whether the path overwhich the management command was sent is included in the TPG. If thestep 712 evaluates to yes, control proceeds to the step 714.

At the step 714, data storage system simulates the local data storagesystem. The first record of MD for the regular volume is updated and/orotherwise used to service the management command. From the step 714,control proceeds to the step 716.

At the step 716, if the management command modifies or updates theregular volume MD in the first record of the MGT DB, processing isperformed to simulate replicating the management command to the remotedata storage system. With a stretched volume, a connection between thelocal and remote systems is used to transmit commands and data in orderto synchronize volume MD. However, in simulation mode when simulatingthe stretched volume, the connection is configured from the single datastorage system to itself and may be characterized as a loop back. TheUID switching logic or component is used to map the WWN1 or LUN ID ofthe regular volume to the corresponding WWN2 or LUN ID of its shadowvolume. The second record of volume MD for the shadow volume may beaccordingly updated based on the management command. From the step 716,control proceeds to the step 718.

At the step 718, acknowledgement or response may be returned to the hostalong with any requested information.

If the step 712 evaluates to no, the path is to a target port in theTPG2 and control proceeds to the step 720. At the step 720, the singledata storage system in which the regular and shadow volumes areconfigured simulates the remote data storage system. A UID switchinglogic or component is used to map the WWN1 or LUN ID of the regularvolume to a corresponding WWN2 or LUN ID of its shadow volume. Thesecond record of MD for the shadow volume is updated or otherwise usedto service the management command. From the step 720, control proceedsto the step 722.

At the step 722, if the management command modifies or updates theshadow volume MD in the MGT DB, processing is performed to simulatereplicating the management command to the local data storage system.With a stretched volume, a connection between the local and remotesystems is used to transmit commands and data in order to synchronizevolume MD. However, in simulation mode when simulating the stretchedvolume, the connection is configured from the data storage system toitself and may be characterized as a loop back. The UID switching logicor component is used to map the WWN2 or LUN ID of the shadow volume tothe corresponding WWN1 or LUN ID of its regular volume. The first recordof volume MD for the regular volume may be accordingly updated based onthe management command. From the step 722, control proceeds to the step724

At the step 724, an acknowledgement or response may be returned to thehost along with any requested information.

The techniques herein may be performed by any suitable hardware and/orsoftware. For example, techniques herein may be performed by executingcode which is stored on any one or more different forms ofcomputer-readable media, where the code may be executed by one or moreprocessors, for example, such as processors of a computer or othersystem, an ASIC (application specific integrated circuit), and the like.Computer-readable media may include different forms of volatile (e.g.,RAM) and non-volatile (e.g., ROM, flash memory, magnetic or opticaldisks, or tape) storage which may be removable or non-removable.

While the invention has been disclosed in connection with embodimentsshown and described in detail, their modifications and improvementsthereon will become readily apparent to those skilled in the art.Accordingly, the spirit and scope of the present invention should belimited only by the following claims.

What is claimed is:
 1. A method of processing management commandscomprising: creating a simulated stretched volume in a single datastorage system, wherein the simulated stretched volume simulates astretched volume configured from two or more volumes in two or more datastorage systems with the two or more volumes exposed to a host as a samevolume having a same first unique identifier over two or more paths fromthe two or more data storage systems, wherein the simulated stretchedvolume is configured from a plurality of volumes of the single datastorage system and the plurality of volumes are assigned a plurality ofunique identifiers associated with the simulated stretched volume, andwherein the plurality of volumes configured as the simulated stretchedvolume are exposed to the host over a plurality of paths from the singledata storage system as the same volume having the same first uniqueidentifier, wherein the single data storage system includes sets oftarget ports, wherein each of the sets of target ports simulates pathsto a different one of the two or more data storage systems; receiving,on a first path of the plurality of paths, a first management commanddirected to the simulated stretched volume configured as the same volumehaving the first unique identifier, wherein the first path is from aninitiator of the host to a first target port of the single data storagesystem, wherein the first target port is included in a first set of theplurality of sets of target ports and the first set of target portssimulates paths to a first data storage system of the two or more datastorage systems; and performing first processing to service the firstmanagement command, wherein the first processing includes the singledata storage system simulating the first data storage system servicingthe first management command.
 2. The method of claim 1, wherein thesingle data storage system includes a management database with aplurality of metadata records for the plurality of volumes, wherein eachof the plurality of volumes is described by metadata of a different oneof the plurality of metadata records, and wherein each of the pluralityof metadata records associated with a particular one of the plurality ofvolumes includes a same set of metadata describing the simulatedstretched volume and includes one of the plurality of unique identifiersassociated with the particular one of the plurality of volumes.
 3. Themethod of claim 2, wherein a first volume of the plurality of volumes inthe single data storage system represents and simulates a particular oneof the two or more volumes, wherein the particular one volume isincluded in the first data storage system, and wherein the firstprocessing includes servicing the first management command using one ofthe plurality of metadata records associated with the first volume. 4.The method of claim 3, wherein a second set of target ports is includedin the plurality of sets of target ports of the single data storagesystem, wherein the second set of target ports simulates paths to asecond data storage system of the two or more data storage systems, andwherein a second volume of the plurality of volumes in the single datastorage system represents and simulates another one of the two or morevolumes, wherein the another one volume is included in the second datastorage system.
 5. The method of claim 4, wherein the first volume is aregular volume configured in the single data storage system with thefirst unique identifier, wherein the second volume is a shadow volume ofthe regular volume, and wherein the shadow volume is configured with asecond unique identifier of the plurality of unique identifiers.
 6. Themethod of claim 5, wherein the first processing includes: using thefirst unique identifier to access a first set of metadata of a first ofthe plurality of metadata records associated with the regular volume. 7.The method of claim 6, wherein servicing the first management commandincludes: reading the first set of metadata associated with the firstunique identifier; and returning a portion of the first set of metadatain accordance with the first management command.
 8. The method of claim6, wherein servicing the first management command includes: updating, inaccordance with the first management command, the first set of metadataof the first metadata record associated with the first unique identifierand the regular volume; and simulating replicating the first managementcommand over a connection to the second data storage system.
 9. Themethod of claim 8, wherein the connection is configured for a simulationmode that simulates the stretched volume and wherein the connection isconfigured from the single data system to the single data storagesystem.
 10. The method of claim 9, wherein said simulating replicatingthe first management command over the connection to second data storagesystem includes: transmitting the first management command over theconnection configured for the simulation mode; mapping the first uniqueidentifier to the second unique identifier; and updating, in accordancewith the first management command, a second set of metadata of a secondof the plurality of metadata records associated with the second uniqueidentifier and the shadow volume.
 11. The method of claim 10, furthercomprising: receiving, over a second path of the plurality of paths, asecond management command directed to the simulated stretched volumeconfigured as the same volume having the first unique identifier,wherein the second path is from an initiator of the host to a secondtarget port of the single data storage system, wherein the second targetport is included in the second set of the plurality of sets of targetports that simulates paths to the second data storage system; andperforming second processing to service the second management command,wherein the second processing includes the single data storage systemsimulating the second data storage system servicing the secondmanagement command.
 12. The method of claim 11, wherein the secondprocessing includes: mapping the first unique identifier associated withthe simulated stretched volume to the second unique identifierassociated with the simulated stretched volume; and using the secondunique identifier to access the second set of metadata of the secondmetadata record associated with the shadow volume.
 13. The method ofclaim 12, wherein servicing the second management command includes:reading the second set of metadata of the second metadata recordassociated with the second identifier and the shadow volume; andreturning a portion of the second set of metadata in accordance with thesecond management command.
 14. The method of claim 12, wherein servicingthe second management command includes: updating, in accordance with thesecond management command, the second set of metadata of the secondmetadata record associated with the second identifier and the shadowvolume; and simulating replicating the second management command overthe connection to the first data storage system.
 15. The method of claim14, wherein said simulating replicating the second management commandover the connection to the first data storage system includes: mappingthe second unique identifier to the first unique identifier;transmitting the second management command over the connectionconfigured for the simulation mode, wherein the second managementcommand is directed to the regular volume having the first uniqueidentifier; and updating, in accordance with the second managementcommand, the first set of metadata of the first metadata recordassociated with the first unique identifier and the regular volume. 16.The method of claim 11, further comprising: receiving a first I/Ocommand on the first path from the host to the single data storagesystem, wherein the first I/O command is directed to the simulatedstretched volume configured as the same volume having the first uniqueidentifier; and servicing the first I/O command using the regularvolume.
 17. The method of claim 16, further comprising: receiving asecond I/O command on the second path from the host to the single datastorage system, wherein the I/O command is directed to the simulatedstretched volume configured as the same volume having the first uniqueidentifier; and servicing the second I/O command using the regularvolume.
 18. A system comprising: one or more processors; and one or morememories comprising code stored thereon that, when executed, performs amethod of processing management commands comprising: creating asimulated stretched volume in a single data storage system, wherein thesimulated stretched volume simulates a stretched volume configured fromtwo or more volumes in two or more data storage systems with the two ormore volumes exposed to a host as a same volume having a same firstunique identifier over two or more paths from the two or more datastorage systems, wherein the simulated stretched volume is configuredfrom a plurality of volumes of the single data storage system and theplurality of volumes are assigned a plurality of unique identifiersassociated with the simulated stretched volume, and wherein theplurality of volumes configured as the simulated stretched volume areexposed to the host over a plurality of paths from the single datastorage system as the same volume having the same first uniqueidentifier, wherein the single data storage system includes sets oftarget ports, wherein each of the sets of target ports simulates pathsto a different one of the two or more data storage systems; receiving,on a first path of the plurality of paths, a first management commanddirected to the simulated stretched volume configured as the same volumehaving the first unique identifier, wherein the first path is from aninitiator of the host to a first target port of the single data storagesystem, wherein the first target port is included in a first set of theplurality of sets of target ports and the first set of target portssimulates paths to a first data storage system of the two or more datastorage systems; and performing first processing to service the firstmanagement command, wherein the first processing includes the singledata storage system simulating the first data storage system servicingthe first management command.
 19. A non-transitory computer readablemedium comprising code stored thereon that, when executed, performs amethod of processing management commands comprising: creating asimulated stretched volume in a single data storage system, wherein thesimulated stretched volume simulates a stretched volume configured fromtwo or more volumes in two or more data storage systems with the two ormore volumes exposed to a host as a same volume having a same firstunique identifier over two or more paths from the two or more datastorage systems, wherein the simulated stretched volume is configuredfrom a plurality of volumes of the single data storage system and theplurality of volumes are assigned a plurality of unique identifiersassociated with the simulated stretched volume, and wherein theplurality of volumes configured as the simulated stretched volume areexposed to the host over a plurality of paths from the single datastorage system as the same volume having the same first uniqueidentifier, wherein the single data storage system includes sets oftarget ports, wherein each of the sets of target ports simulates pathsto a different one of the two or more data storage systems; receiving,on a first path of the plurality of paths, a first management commanddirected to the simulated stretched volume configured as the same volumehaving the first unique identifier, wherein the first path is from aninitiator of the host to a first target port of the single data storagesystem, wherein the first target port is included in a first set of theplurality of sets of target ports and the first set of target portssimulates paths to a first data storage system of the two or more datastorage systems; and performing first processing to service the firstmanagement command, wherein the first processing includes the singledata storage system simulating the first data storage system servicingthe first management command.